使用Conformer增强的混合CTC/Attention端到端中文语音识别

doi:10.3778/j.issn.1002-8331.2111-0462

摘要/Abstract

摘要： 最近，基于自注意力的Transformer结构在不同领域的一系列任务上表现出非常好的性能。探索了基于Transformer编码器和LAS（listen，attend and spell）解码器的Transformer-LAS语音识别模型的效果，并针对Transformer不善于捕捉局部信息的问题，使用Conformer代替Transformer，提出Conformer-LAS模型。由于Attention过于灵活的对齐方式，使得在嘈杂环境中的效果急剧下降，采用连接时序分类（connectionist temporal classification，CTC）辅助训练以加快收敛，并加入音素级别的中间CTC损失联合优化，提出了效果更好的Conformer-LAS-CTC语音识别模型。在开源中文普通话Aishell-1数据集上对提出来的模型进行验证，实验结果表明，Conformer-LAS-CTC相对于采用的基线BLSTM-LAS和Transformer-LAS模型在测试集上的字错率分别相对降低了22.58%和48.76%，模型最终字错误率为4.54%。

关键词: 端到端, 语音识别, Conformer, LAS, 连接时序分类

Abstract: Recently, the Transformer structure based on self-attention has shown very good performance on a series of tasks in different fields. Firstly, the effect of speech recognition model Transformer-LAS based on Transformer encoder and LAS（listen，attend and spell） decoder is explored. And in view of the problem that Transformer is not good at capturing local information, a Conformer-LAS model that uses Conformer instead of Transformer is proposed for automatic speech recognition. Secondly, due to the excessively flexible alignment of Attention, its effect in a noisy environment will drop sharply, the connectionist temporal classification（CTC） is used to assist training to speed up the convergence, the joint optimization of the intermediate CTC loss at the phoneme level is joined, and a better Conformer-LAS-CTC speech recognition model is proposed. Finally, the proposed model is verified on the open source Chinese Mandarin Aishell-1 data set. The experimental results show that compared with the baseline BLSTM-LAS and Transformer-LAS models, the character error rate of Conformer-LAS-CTC on the test set is reduced by 22.58% and 48.76% respectively, and the final character error rate of the model is 4.54%.

Key words: end-to-end, speech recognition, Conformer, LAS, connectionist temporal classification

陈戈, 谢旭康, 孙俊, 陈祺东. 使用Conformer增强的混合CTC/Attention端到端中文语音识别[J]. 计算机工程与应用, 2023, 59(4): 97-103.

CHEN Ge, XIE Xukang, SUN Jun, CHEN Qidong. Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer[J]. Computer Engineering and Applications, 2023, 59(4): 97-103.

参考文献

[1] SONG M，ZHANG Q，PAN J，et al.Improving HMM/DNN in ASR of under-resourced languages using probabilistic sampling[C]//2015 IEEE China Summit and International Conference on Signal and Information Processing，2015：20-24.
[2] PEDDINTI V，POVEY D，KHUDANPUR S.A time delay neural network architecture for efficient modeling of long temporal contexts[C]//16th Annual Conference of the International Speech Communication Association，2015.
[3] GRAVES A，FERNáNDEZ S，GOMEZ F，et al.Connectionist temporal classification：labelling unsegmented sequence data with recurrent neural networks[C]//23rd International Conference on Machine Learning，2006：369-376.
[4] AMODEI D，ANANTHANARAYANAN S，ANUBHAI R，et al.Deep speech 2：end-to-end speech recognition in English and Mandarin[C]//International Conference on Machine Learning，2016：173-182.
[5] CHOROWSKI J，BAHDANAU D，CHO K，et al.End-to-end continuous speech recognition using attention-based recurrent NN：first results[J].arXiv：1412.1602，2014.
[6] CHOROWSKI J，BAHDANAU D，SERDYUK D，et al.Attention-based models for speech recognition[C]//28th International Conference on Neural Information Processing Systems，2015：577-585.
[7] GRAVES A.Sequence transduction with recurrent neural networks[J].arXiv：1211.3711，2012.
[8] HUANG M，ZHANG J，CAI M，et al.Improving RNN transducer with normalized jointer network[J].arXiv：2011.01576，2020.
[9] ZHANG Q，LU H，SAK H，et al.Transformer transducer：a streamable speech recognition model with transformer encoders and RNN-T loss[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：7829-7833.
[10] XIAO Z，OU Z，CHU W，et al.Hybrid CTC-attention based end-to-end speech recognition using subword units[C]//2018 11th International Symposium on Chinese Spoken Language Processing，2018：146-150.
[11] HANNUN A，CASE C，CASPER J，et al.Deep speech：scaling up end-to-end speech recognition[J].arXiv：1412.5567，2014.
[12] GODFREY J J，HOLLIMAN E C，MCDANIEL J.SWITCHBOARD：telephone speech corpus for research and development[C]//1992 IEEE International Conference on Acoustics，Speech，and Signal Processing，1992，1：517-520.
[13] LEE J，WATANABE S.Intermediate loss regularization for CTC-based speech recognition[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：6224-6228.
[14] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：5998-6008.
[15] GULATI A，QIN J，CHIU C C，et al.Conformer：convolution-augmented transformer for speech recognition[J].arXiv：2005.08100，2020.
[16] CHAN W，JAITLY N，LE Q，et al.Listen，attend and spell：a neural network for large vocabulary conversational speech recognition[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing，2016.
[17] SUTSKEVER I，VINYALS O，LE Q V.Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems，2014，27.
[18] BU H，DU J，NA X，et al.Aishell-1：an open-source Mandarin speech corpus and a speech recognition base-line[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment，2017：1-5.
[19] DONG L，XU S，XU B.Speech-Transformer：a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing，2018：5884-5888.
[20] KARITA S，CHEN N，HAYASHI T，et al.A comparative study on Transformer vs RNN in speech applications[C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop，2019：449-456.
[21] DAI Z，YANG Z，YANG Y，et al.Transformer-XL：attentive language models beyond a fixed-length context[J].arXiv：1901.02860，2019.
[22] KINGMA D P，BA J.Adam：a method for stochastic optimization[J].arXiv：1412.6980，2014.
[23] KO T，PEDDINTI V，POVEY D，et al.Audio augmentation for speech recognition[C]//16th Annual Conference of the International Speech Communication Association，2015.
[24] PARK D S，CHAN W，ZHANG Y，et al.SpecAugment：a simple data augmentation method for automatic speech recognition[J].arXiv：1904.08779，2019.
[25] RAMACHANDRAN P，ZOPH B，LE Q V.Searching for activation functions[J].arXiv：1710.05941，2017.
[26] GUO P，BOYER F，CHANG X，et al.Recent developments on ESPnet toolkit boosted by Conformer[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：5874-5878.
[27] ZHANG B，WU D，YANG C，et al.WeNet：production first and production ready end-to-end speech recognition toolkit[J].arXiv：2102.01547，2021.
[28] 朱学超，张飞，高鹭，等.基于残差网络和门控卷积网络的语音识别研究[J].计算机工程与应用，2022，58（7）：185-191.
ZHU X C，ZHANG F，GAO L，et al.Research on speech recognition based on residual network and gated convolution network[J].Computer Engineering and Applications，2022，58（7）：185-191.
[29] 谢旭康，陈戈，孙俊，等.TCN-Transformer-CTC的端到端语音识别[J].计算机应用研究，2022，39（3）：699-703.
XIE X K，CHEN G，SUN J，et al.TCN-Transformer-CTC for end-to-end speech recognition[J].Application Research of Computers，2022，39（3）：699-703.
[30] LIANG C，XU M，ZHANG X L.Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention[J].arXiv：2103.15722，2021.
[31] LI S，XU M，ZHANG X L.Conformer-based end-to-end speech recognition with rotary position embedding[J].arXiv：2107.05907，2021.