Research on Speech Recognition Based on Residual Network and Gated Convolution Network

doi:10.3778/j.issn.1002-8331.2108-0265

Abstract

Abstract: Due to the complex structure of the traditional recurrent neural network, a large amount of data are needed to correctly train in continuous speech recognition, and the training takes a lot of time and requires a lot of hardware performance. In response to the above problems, an algorithm based on residual network and gated convolutional neural network is proposed, and combined with the connection sequence classification algorithm, an end-to-end Chinese speech recognition model is constructed. The model takes the spectrogram as input, extracts high-level abstract features through the residual network, and then captures effective long-term memory through the stacked gated convolutional neural network, getting rid of the traditional recurrent neural network’s dependence on contextual relevance modeling, and speeds up training speed of the model. Among them, the residual network is optimized, and the feedforward neural network is added to the gated convolutional neural network, which greatly improves the performance of the model. Experimental results show that on the Aishell-1 Chinese data set, the word error rate of the model is reduced to 11.43%; and in the environment of ?5?dB low signal-to-noise ratio, the word error rate reaches 19.77%.

Key words: residual network, gated convolutional neural network, connectionist temporal classification, Swish activation function

摘要： 由于传统循环神经网络具有复杂的结构，需要大量的数据才能在连续语音识别中进行正确训练，并且训练需要耗费大量的时间，对硬件性能要求很大。针对以上问题，提出了基于残差网络和门控卷积神经网络的算法，并结合联结时序分类算法，构建端到端中文语音识别模型。该模型将语谱图作为输入，通过残差网络提取高层抽象特征，然后通过堆叠门控卷积神经网络捕获有效的长时间记忆，摆脱了传统循环神经网络对上下文相关性建模的依赖，加快了模型的训练速度。对残差网络进行了优化，并在门控卷积神经网络中加入了前馈神经网络，极大提高了模型的性能。实验结果表明，在Aishell-1中文数据集上，该模型的字错误率降低至11.43%；并且在?5?dB低信噪比环境下，字错误率达到了19.77%。

关键词: 残差网络, 门控卷积神经网络, 联结时序分类, Swish激活函数

ZHU Xuechao, ZHANG Fei, GAO Lu, REN Xiaoying, HAO Bin. Research on Speech Recognition Based on Residual Network and Gated Convolution Network[J]. Computer Engineering and Applications, 2022, 58(7): 185-191.

朱学超, 张飞, 高鹭, 任晓颖, 郝斌. 基于残差网络和门控卷积网络的语音识别研究[J]. 计算机工程与应用, 2022, 58(7): 185-191.

References

[1] POVEY D，WOODLAND P C.Minimum phone error and I-smoothing for improved discriminative training[C]// Proceedings of IEEE International Conference on Acoustics，Speech，and Signal Processing，2002.
[2] ALLAHVERDYAN A，GALSTYAN A.Comparative analysis of viterbi training and maximum likelihood estimation for hmms[C]//Advances in Neural Information Processing Systems，2011：1674-1682.
[3] KARPAGAVALLI S，CHANDRA E.Phoneme and word based model for tamil speech recognition using GMM-HMM[C]//Proceedings of 2015 International Conference on Advanced Computing and Communication Systems，2015.
[4] GRAVES A，FERNáNDEZ S，GOMEZ F，et al.Connectionist temporal classification：Labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine learning，2006：369-376.
[5] 姚煜，RYAD C.基于双向长短时记忆-联结时序分类和加权有限状态转换器的端到端中文语音识别系统[J].计算机应用，2018，38（9）：2495-2499.
YAO Y，RYAD C.End-to-end Chinese speech recognition system using bidirectional long short-term memory networks and weighted finite-state transducers[J].Journal of Compute Applications，2018，38（9）：2495-2499.
[6] 张立民，王彦哲，张兵强，等.基于CTC准则的普通话识别及改进[J].计算机工程，2019，45（6）：249-253.
ZHANG L M，WANG Y Z，ZHANG B Q，et al.Mandarin recognition and improvement based on CTC criteria[J]. Computer Engineering，2019，45（6）：249-253.
[7] WANG D，WANG X，LYU S.End-to-end mandarin speech recognition combining CNN and BLSTM[J].Symmetry，2019，11（5）：644.
[8] 胡章芳，徐轩，付亚芹，等.基于ResNet-BLSTM的端到端语音识[J].计算机工程与应用，2020，56（18）：124-130.
HU Z F，XU X，FU Y Q，et al.End to end speech recognition based on ResNet-BLSTM[J].Computer Engineering and Applications，2020，56（18）：124-130.
[9] ZHANG S，JIANG H，WEI S，et al.Feedforward sequential memory neural networks without recurrent feedback[J]. arXiv：1510.02693，2015.
[10] ZHANG S，LIU C，JIANG H，et al.Nonrecurrent neural structure for long-term dependence[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2017，25（4）：871-884.
[11] ZHANG S，LEI M，YAN Z，et al.Deep-FSMN for large vocabulary continuous speech recognition[C]//Proceedings of 2018 IEEE International Conference on Acoustics，Speech and Signal Processing，2018：5869-5873.
[12] 胡章芳，蹇芳，唐珊珊，等.DFSMN-T：结合强语言模型Transformer的中文语音识别[J/OL].计算机工程与应用：1-11（2021-04-19）[2021-08-12].http：//kns.cnki.net/kcms/detail/11.2127.TP.20210419.1433.059.html.
HU Z F，JIAN F，TANG S S，et al.DFSMN-T：Mandarin speech recognition with language model transformer[J/OL].Computer Engineering and Applications：1-11（2021-04-19）[2021-08-12].http：//kns.cnki.net/kcms/dtail/11.2127.TP.20210419.1433.059.html.
[13] DAUPHIN Y N，FAN A，AULI M，et al.Language modeling with gated convolutional networks[C]//Proceedings of International Conference on Machine Learning，2017：933-941.
[14] 杨德举，马良荔，谭琳珊，等.基于门控卷积网络与CTC的端到端语音识别[J].计算机工程与设计，2020，41（9）：258-262.
YANG D J，MA L L，TAN L S，et al.End-to-end speech recognition based on gated convolutional neural network and CTC[J].Computer Engineering and Design，2020，41（9）：258-262.
[15] SAINATH T N，VINYALS O，SENIOR A，et al.Convolutional，long short-term memory，fully connected deep neural networks[C]//Proceedings of 2015 IEEE International Conference on Acoustics，Speech and Signal Processing，2015：4580-4584.
[16] HORI T，WATANABE S，ZHANG Y，et al.Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM[J].arXiv：1706. 02737，2017.
[17] HAN W，ZHANG Z，ZHANG Y，et al.Contextnet：Improving convolutional neural networks for automatic speech recognition with global context[J].arXiv：2005. 03191，2020.
[18] PASSRICHA V，AGGARWAL R K.A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition[J].Journal of Intelligent Systems，2020，29（1）：1261-1274.
[19] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016：770-778.
[20] IOFFE S，SZEGEDY C.Batch normalization：Accelerating deep network training by reducing internal covariate shift[C]//Proceedings of International Conference on Machine Learning，2015：448-456.
[21] HE K，ZHANG X，REN S，et al.Identity mappings in deep residual networks[C]//Proceedings of European Conference on Computer Vision，2016：630-645.
[22] RAMACHANDRAN P，ZOPH B，LE Q V.Searching for activation functions[J].arXiv：1710.05941，2017.
[23] HOWARD A G，ZHU M，CHEN B，et al.Mobilenets： Efficient convolutional neural networks for mobile vision applications[J].arXiv：1704.04861，2017.
[24] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：5998-6008.
[25] BU H，DU J，NA X，et al.Aishell-1：An open-source mandarin speech corpus and a speech recognition baseline[C]//Proceedings of 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment，2017：1-5.