计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (7): 185-191.DOI: 10.3778/j.issn.1002-8331.2108-0265

• 模式识别与人工智能 • 上一篇    下一篇

基于残差网络和门控卷积网络的语音识别研究

朱学超,张飞,高鹭,任晓颖,郝斌   

  1. 内蒙古科技大学 信息工程学院,内蒙古 包头 014000
  • 出版日期:2022-04-01 发布日期:2022-04-01

Research on Speech Recognition Based on Residual Network and Gated Convolution Network

ZHU Xuechao, ZHANG Fei, GAO Lu, REN Xiaoying, HAO Bin   

  1. School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014000, China
  • Online:2022-04-01 Published:2022-04-01

摘要: 由于传统循环神经网络具有复杂的结构,需要大量的数据才能在连续语音识别中进行正确训练,并且训练需要耗费大量的时间,对硬件性能要求很大。针对以上问题,提出了基于残差网络和门控卷积神经网络的算法,并结合联结时序分类算法,构建端到端中文语音识别模型。该模型将语谱图作为输入,通过残差网络提取高层抽象特征,然后通过堆叠门控卷积神经网络捕获有效的长时间记忆,摆脱了传统循环神经网络对上下文相关性建模的依赖,加快了模型的训练速度。对残差网络进行了优化,并在门控卷积神经网络中加入了前馈神经网络,极大提高了模型的性能。实验结果表明,在Aishell-1中文数据集上,该模型的字错误率降低至11.43%;并且在?5?dB低信噪比环境下,字错误率达到了19.77%。

关键词: 残差网络, 门控卷积神经网络, 联结时序分类, Swish激活函数

Abstract: Due to the complex structure of the traditional recurrent neural network, a large amount of data are needed to correctly train in continuous speech recognition, and the training takes a lot of time and requires a lot of hardware performance. In response to the above problems, an algorithm based on residual network and gated convolutional neural network is proposed, and combined with the connection sequence classification algorithm, an end-to-end Chinese speech recognition model is constructed. The model takes the spectrogram as input, extracts high-level abstract features through the residual network, and then captures effective long-term memory through the stacked gated convolutional neural network, getting rid of the traditional recurrent neural network’s dependence on contextual relevance modeling, and speeds up training speed of the model. Among them, the residual network is optimized, and the feedforward neural network is added to the gated convolutional neural network, which greatly improves the performance of the model. Experimental results show that on the Aishell-1 Chinese data set, the word error rate of the model is reduced to 11.43%; and in the environment of ?5?dB low signal-to-noise ratio, the word error rate reaches 19.77%.

Key words: residual network, gated convolutional neural network, connectionist temporal classification, Swish activation function