DFSMN-T：Mandarin Speech Recognition with Language Model Transformer

doi:10.3778/j.issn.1002-8331.2012-0264

Abstract

Abstract: An automatic speech recognition system consists of two parts：an acoustic model and a language model, but the traditional language model N-gram has problems such as ignoring the semantic similarity of words and having too large parameters, which limit the further reduction of the error rate of speech recognition characters. To address the above problems, this paper proposes a novel speech recognition system that uses Chinese syllables（pinyin） as intermediate characters, uses a deep feed-forward sequence memory neural network DFSMN as the acoustic model, performs the speech-to-Chinese syllable task, and then understands pinyin-to-Chinese characters as a translation task, and introduces Transformer as the language model. At the same time, it proposes a method to reduce the computational complexity of Transformer by introducing Hadamard matrix for filtering when calculating attention weights and discarding parameters below a threshold value, thus resulting in faster model decoding. Experiments on datasets such as Aishell-1 and Thchs30 show that compared to the DFSMN combined with the 3-gram model, the DFSMN and improved Transformer-based speech recognition system achieves a relative decrease of 3.2% in character error rate to 11.8% character error rate on the optimal model. Compared to the BLSTM model speech recognition system, its character error rate is relatively reduced by 7.1%, compared to the BLSTM model speech recognition system.

Key words: speech recognition, deep feedforward sequential memory networks（DFSMN）, Transformer, Chinese syllables, Hadamard matrix

摘要： 自动语音识别系统由声学模型和语言模型两部分构成，但传统语言模型N-gram存在忽略词条语义相似性、参数过大等问题，限制了语音识别字符错误率的进一步降低。针对上述问题，提出一种新型的语音识别系统，以中文音节（拼音）作为中间字符，以深度前馈序列记忆神经网络DFSMN作为声学模型，执行语音转中文音节任务，进而将拼音转汉字理解成翻译任务，引入Transformer作为语言模型；同时提出一种减少Transformer计算复杂度的简易方法，在计算注意力权值时引入Hadamard矩阵进行滤波，对低于阈值的参数进行丢弃，使得模型解码速度更快。在Aishell-1、Thchs30等数据集上的实验表明，相较于DFSMN结合3-gram模型，基于DFSMN和改进Transformer的语音识别系统在最优模型上的字符错误率相对下降了3.2%，达到了11.8%的字符错误率；相较于BLSTM模型语音识别系统，其字符错误率相对下降了7.1%。

关键词: 语音识别, 深度前馈序列记忆神经网络（DFSMN）, Transformer, 中文音节, Hadamard矩阵

HU Zhangfang, JIAN Fang, TANG Shanshan, MING Ziping, JIANG Bowen. DFSMN-T：Mandarin Speech Recognition with Language Model Transformer[J]. Computer Engineering and Applications, 2022, 58(9): 187-194.

胡章芳, 蹇芳, 唐珊珊, 明子平, 姜博文. DFSMN-T：结合强语言模型Transformer的中文语音识别[J]. 计算机工程与应用, 2022, 58(9): 187-194.

References

[1] 邓江云，李晟.基于GMM-HMM的语音识别垃圾分类系统[J].现代计算机，2020（26）：27-32.
DENG J Y，LI S.Speech recognition garbage classification system based on GMM-HMM[J].Modern Computers，2020（26）：27-32.
[2] KANISHKA R，HAIM S，PRABHAVALKAR R.Exploring architectures，data and units for streaming end-to-end speech recognition with RNN-transducer[C]//IEEE Automatic Speech Recognition and Understanding Workshop（ASRU），Okinawa，2018：193-199.
[3] 唐美丽，胡琼，马廷淮.基于循环神经网络的语音识别研究[J].现代电子技术，2019，42（14）：152-156.
TANG M L，HU Q，MA T H.Research on speech recognition based on recurrent neural networks[J].Modern Electronics Technique，2019，42（14）：152-156.
[4] YI J，WEN Z，TAO J，et al.CTC regularized model adaptation for improving LSTM RNN based multi-accent mandarin speech recognition[J].Journal of Signal Processing Systems，2017，90：1-13.
[5] BAHDANAU D，CHOROWSHI J，SERDYUK D，et al.End-to-end attention-based large vocabulary speech recognition[C]//IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2016.
[6] PRABHAVALKAR R，SAINATH T N，LI B，et al.An analysis of “Attention” in sequence-to-sequence models[C]//Proceedings of Interspeech，2017：3702-3706.
[7] AMODEI D，ANANTHANARAYANAN S，ANUBHAI R，et al.Deep speech 2：end-to-end speech recognition in english and mandarin[J].arXiv：1512.02595，2015.
[8] 胡章芳，徐轩，付亚芹，等.基于ResNet-BLSTM的端到端语音识别[J].计算机工程与应用，2020，56（18）：124-130.
HU Z F，XU X，FU Y Q，et al.End-to-end speech recognition based on ResNet-BLSTM[J].Computer Engineering and Applications，2020，56（18）：124-130.
[9] ZHANG S L，JIANG H，WEI S，et al.Feedforward sequential memory neural networks without recurrent feedback[J].arXiv：1510.02693，2015.
[10] ZHANG S L，LIU C，JIANG H，et al.Nonrecurrent neural structure for long-term dependence[J].IEEE/ACM Transactions on Audio Speech & Language Processing，2017，25（4）：871-884.
[11] ZHANG S L，LEI M，YAN M.Deep-FSMN for large vocabulary continuous speech recognition[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Calgary，AB，2018：5869-5873.
[12] ASHISH V，NOMA S，NIKI P，et al.Attention is all you need[J].arXiv：1706.03762，2017.
[13] ZHOU S，DONG L，XU S，et al.Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin Chinese[J].arXiv：1804.10752，2018.
[14] DALMIA S，LI X，METZE F，et al.Domain robust feature extraction for rapid low resource ASR development[C]//IEEE Spoken Language Technology Workshop（SLT），2019.
[15] ZHANG S，LEI M，YAN Z.Automatic spelling correction with transformer for CTC-based end-to-end speech recognition[J].arXiv：1904.10045，2019.
[16] PUNDAK G，SAINATH T N.Lower frame rate neural network acoustic models[C]//Interspeech，2016.
[17] XUE S，YAN Z.Improving latency-controlled BLSTM acoustic models for online speech recognition[C]//IEEE International Conference on Acoustics，2017.
[18] GRAOVAC J，MLADENOVIC M，TANASIJEVIC I.NgramSPD：exploring optimal n-gram model for sentiment polarity detection in different languages[J].Intelligent Data Analysis，2019，23（2）：279-296.
[19] PIBIRI G E，VENTURINI R.Handling massive n-gram datasets efficiently[J].ACM Transactions on Information Systems，2019，37（2）：1-41.
[20] BU H，DU J Y，NA X Y，et al.Aishell-1：an opensource mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment（O-COCOSDA），2017.
[21] WANG D，ZHANG X.THCHS-30：a free Chinese speech corpus[J].arXiv：1512.01882，2015.