计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (18): 124-130.DOI: 10.3778/j.issn.1002-8331.1907-0019

• 模式识别与人工智能 • 上一篇    下一篇

基于ResNet-BLSTM的端到端语音识别

胡章芳,徐轩,付亚芹,夏志广,马苏东   

  1. 1.重庆邮电大学 光电工程学院,重庆 400065
    2.重庆邮电大学 先进制造学院,重庆 400065
  • 出版日期:2020-09-15 发布日期:2020-09-10

End to End Speech Recognition Based on ResNet-BLSTM

HU Zhangfang, XU Xuan, FU Yaqin, XIA Zhiguang, MA Sudong   

  1. 1.School of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
    2.School of Advanced Manufacturing Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Online:2020-09-15 Published:2020-09-10

摘要:

基于深度学习的端到端语音识别模型中,由于模型的输入采用固定长度的语音帧,造成时域信息和部分高频信息损失进而导致识别率不高、鲁棒性差等问题。针对上述问题,提出了一种基于残差网络与双向长短时记忆网络相结合的模型,该模型采用语谱图作为输入,同时在残差网络中设计并行卷积层,提取不同尺度的特征,然后进行特征融合,最后采用连接时序分类方法进行分类,实现一个端到端的语音识别模型。实验结果表明,该模型在Aishell-1语音集上字错误率相较于传统端到端模型的WER下降2.52%,且鲁棒性较好。

关键词: 残差网络(ResNet), 双向长短时记忆网络(BLSTM), 并行卷积层, 连接时序分类

Abstract:

In the end-to-end speech recognition model based on deep learning, the input of the model adopts fixed length speech frames, which results in the loss of time-domain information and part of high-frequency information, resulting in low recognition rate and at weak robust of system. According to the above problem, this paper proposes a model based on the ResNet and the BLSTM, the model uses the spectrogram as input, and simultaneously designs the parallel convolution layer in the residual network, extracts features of different scales, and then performs features fusion, and finally uses the connection timing classification method to classify and realize an end-to-end speech recognition model. The experimental results show that compared with the traditional end-to-end model, the WER of the model in this paper decreases by 2.52% on the Aishell-1 speech set, and the robustness is better.

Key words: Residual Network(ResNet), Bi-directional Long Short-Term Memory(BLSTM), parallel convolutional layer, connectionist temporal classification