Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (4): 97-103.DOI: 10.3778/j.issn.1002-8331.2111-0462

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer

CHEN Ge, XIE Xukang, SUN Jun, CHEN Qidong   

  1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2023-02-15 Published:2023-02-15



  1. 江南大学 人工智能与计算机学院,江苏 无锡 214122

Abstract: Recently, the Transformer structure based on self-attention has shown very good performance on a series of tasks in different fields. Firstly, the effect of speech recognition model Transformer-LAS based on Transformer encoder and LAS(listen,attend and spell) decoder is explored. And in view of the problem that Transformer is not good at capturing local information, a Conformer-LAS model that uses Conformer instead of Transformer is proposed for automatic speech recognition. Secondly, due to the excessively flexible alignment of Attention, its effect in a noisy environment will drop sharply, the connectionist temporal classification(CTC) is used to assist training to speed up the convergence, the joint optimization of the intermediate CTC loss at the phoneme level is joined, and a better Conformer-LAS-CTC speech recognition model is proposed. Finally, the proposed model is verified on the open source Chinese Mandarin Aishell-1 data set. The experimental results show that compared with the baseline BLSTM-LAS and Transformer-LAS models, the character error rate of Conformer-LAS-CTC on the test set is reduced by 22.58% and 48.76% respectively, and the final character error rate of the model is 4.54%.

Key words: end-to-end, speech recognition, Conformer, LAS, connectionist temporal classification

摘要: 最近,基于自注意力的Transformer结构在不同领域的一系列任务上表现出非常好的性能。探索了基于Transformer编码器和LAS(listen,attend and spell)解码器的Transformer-LAS语音识别模型的效果,并针对Transformer不善于捕捉局部信息的问题,使用Conformer代替Transformer,提出Conformer-LAS模型。由于Attention过于灵活的对齐方式,使得在嘈杂环境中的效果急剧下降,采用连接时序分类(connectionist temporal classification,CTC)辅助训练以加快收敛,并加入音素级别的中间CTC损失联合优化,提出了效果更好的Conformer-LAS-CTC语音识别模型。在开源中文普通话Aishell-1数据集上对提出来的模型进行验证,实验结果表明,Conformer-LAS-CTC相对于采用的基线BLSTM-LAS和Transformer-LAS模型在测试集上的字错率分别相对降低了22.58%和48.76%,模型最终字错误率为4.54%。

关键词: 端到端, 语音识别, Conformer, LAS, 连接时序分类