Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (17): 143-149.DOI: 10.3778/j.issn.1002-8331.1805-0486

Previous Articles     Next Articles

End-to-End Mandarin Speech Recognition with Improved Convolution Input

WANG Yanzhe, ZHANG Limin, ZHANG Bingqiang, LI Zhenyu   

  1. Institute of Information Fusion, Naval Aviation University, Yantai, Shandong 264000,China
  • Online:2019-09-01 Published:2019-08-30



  1. 海军航空大学 信息融合研究所,山东 烟台 264000

Abstract: The cross-entropy criterion of mainstream neural network training is to classify and optimize each frame of acoustic data,while the continuous speech recognition uses the sequence-level transcription accuracy as a performance measure. In view of this difference, an end-to-end speech recognition system based on sequence level transcription is constructed in this paper.In order to solve the problem of poor system performance under the condition of low resource corpus,the model uses convolution neural network to deal with the input features, selects the best network structure, and performs two-dimensional convolution in the time and frequency domains thus improves the small disturbance influence caused by different environment and speaker in the input space. At the same time, neural network uses batch normalization technology to reduce generalization error and speed up training. Finally, based on the large language model, the hyper-parameters in decoding process are optimized to improve the modeling effect. Experimental results show that the system performance is improved by about 24%,better than mainstream speech recognition systems.

Key words: sequence level, low resource, end-to-end, convolution neural network, batch normalization

摘要: 主流神经网络训练的交叉熵准则是对声学数据的每个帧进行分类优化,而连续语音识别是以序列级转录准确性为性能度量。针对这个不同,构建基于序列级转录的端到端语音识别系统。针对低资源语料条件下系统性能不佳的问题,其中模型使用卷积神经网络对输入特征进行处理,选取最佳的网络结构,在时域和频域进行二维卷积,从而改善输入空间中因不同环境和说话人产生的小扰动影响。同时神经网络使用批量归一化技术来减少泛化误差,加速训练。基于大型的语言模型,优化解码过程中的超参数,提高模型建模效果。实验结果表明系统性能提升约24%,优于主流语音识别系统。

关键词: 序列级, 低资源, 端到端, 卷积神经网络, 批量归一化