Review of End-to-End Streaming Speech Recognition

doi:10.3778/j.issn.1002-8331.2206-0306

Abstract

Abstract: Speech recognition is an important way to realize human-computer interaction and the basic link of natural language processing. With the development of artificial intelligence technology, streaming speech recognition is required in a large number of application scenarios such as human-computer interaction. Streaming speech recognition is defined as input speech and output result. It can greatly reduce the processing time of speech recognition in human-computer interaction. At present, end-to-end speech recognition has achieved fruitful research achievements in the academic research field, while streaming speech recognition still has some challenges and difficulties in academic research and industrial applications. Therefore, in the last two years, end-to-end speech recognition has gradually become a research hotspot and focus in the field of speech. From the aspects of end-to-end streaming recognition model and performance optimization, the research in recent years is comprehensively investigated and analyzed, including the following contents：（1）Various methods and models of end-to-end streaming speech recognition are analyzed and summarized in detail, including CTC and RNN-T models which directly realize streaming speech recognition, and monotone attention mechanism which improves attention mechanism to realize streaming speech recognition. （2）The methods to improve the recognition accuracy and reduce the delay of the end-to-end streaming speech recognition model are introduced. In terms of improving the accuracy, there are mainly methods such as minimum word error rate training and knowledge distillation, and in terms of reducing the delay, there are mainly methods such as alignment and regularization. （3）Some common Chinese and English open source data sets and performance evaluation criteria of streaming speech recognition models are introduced. （4）The future development and prospect of the end-to-end streaming speech recognition model are discussed.

Key words: human-computer interaction, speech recognition, end to end, streaming, delay

摘要： 语音识别是实现人机交互的一种重要途径，是自然语言处理的基础环节，随着人工智能技术的发展，人机交互等大量应用场景存在着流式语音识别的需求。流式语音识别的定义是一边输入语音一边输出结果，它能够大大减少人机交互过程中语音识别的处理时间。目前在学术研究领域，端到端语音识别已经取得了丰硕的研究成果，而流式语音识别在学术研究以及工业应用中还存在着一些挑战与困难，因此，最近两年，端到端流式语音识别逐渐成为语音领域的一个研究热点与重点。从端到端流式识别模型与性能优化等方面对近些年所展开的研究进行全面的调查与分析，具体包括以下内容：（1）详细分析和归纳了端到端流式语音识别的各种方法与模型，包括直接实现流式识别的CTC与RNN-T模型，以及对注意力机制进行改进以实现流式识别的单调注意力机制等方法；（2）介绍了端到端流式语音识别模型提高识别准确率与减少延迟的方法，在提高准确率方面，主要有最小词错率训练、知识蒸馏等方法，在降低延迟方面，主要有对齐、正则化等方法；（3）介绍了流式语音识别一些常用的中英文开源数据集以及流式识别模型的性能评价标准；（4）讨论了端到端流式语音识别模型的未来发展与展望。

关键词: 人机交互, 语音识别, 端到端, 流式, 延迟

WANG Aohui, ZHANG Long, SONG Wenyu, MENG Jie. Review of End-to-End Streaming Speech Recognition[J]. Computer Engineering and Applications, 2023, 59(2): 22-33.

王澳回, 张珑, 宋文宇, 孟杰. 端到端流式语音识别研究综述[J]. 计算机工程与应用, 2023, 59(2): 22-33.

References

[1] BILMES J A.What HMMs can do[J].IEICE Transactions on Information and Systems，2006，89（3）：869-891.
[2] LI J，YU D，HUANG J T，et al.Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM[C]//2012 IEEE Spoken Language Technology Workshop（SLT），2012：131-136.
[3] MIAO Y，METZE F.Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training[C]//Proceedings of INTERSPEECH，2013：2237-2241.
[4] SHAHIN M，AHMED B，MCKECHNIE J，et al.A comparison of GMM-HMM and DNN-HMM based pronunciation verification techniques for use in the assessment of childhood apraxia of speech[C]//Fifteenth Annual Conference of the International Speech Communication Association，2014：1583-1587.
[5] HANNUN A，CASE C，CASPER J，et al.Deep speech：scaling up end-to-end speech recognition[J].arXiv：1412.5567，2014.
[6] GRAVES A，JAITLY N.Towards end-to-end speech recognition with recurrent neural networks[C]//International Conference on Machine Learning，2014：1764-1772.
[7] WATANABE S，HORI T，KIM S，et al.Hybrid CTC/attention architecture for end-to-end speech recognition[J].IEEE Journal of Selected Topics in Signal Processing，2017，11（8）：1240-1253.
[8] CHAN W，JAITLY N，LE Q V，et al.Listen，attend and spell[J].arXiv：1508.01211，2015.
[9] LI J，ZHAO R，MENG Z，et al.Developing RNN-T models surpassing high-performance hybrid models with customi-
zation capability[J].arXiv：2007.15188，2020.
[10] SAINATH T N，HE Y，LI B，et al.A streaming on-device end-to-end model surpassing server-side conventional model quality and latency[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：6059-6063.
[11] GRAVES A，FERNáNDEZ S，GOMEZ F，et al.Connectionist temporal classification：labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning，2006：369-376.
[12] GRAVES A.Sequence transduction with recurrent neural networks[J].arXiv：1211.3711，2012.
[13] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：6000-6010.
[14] BAHDANAU D，CHOROWSKI J，SERDYUK D，et al.End-to-end attention-based large vocabulary speech recognition[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2016：4945-4949.
[15] HANNUN A.The history of speech recognition to the year 2030[J].arXiv：2108.00084，2021.
[16] LI J.Recent advances in end-to-end automatic speech recognition[J].arXiv：2111.01690，2021.
[17] SAK H，SHANNON M，RAO K，et al.Recurrent neural aligner：an encoder-decoder neural network model for sequence to sequence mapping[C]//Proceedings of INTERSPEECH，2017：1298-1302.
[18] MIAO Y，GOWAYYED M，METZE F.EESEN：end-to-end speech recognition using deep RNN models and WFST-based decoding[C]//2015 IEEE Workshop on Automatic Speech Recognition and Understanding（ASRU），2015：167-174.
[19] SOLTAU H，LIAO H，SAK H.Neural speech recognizer：acoustic-to-word LSTM model for large vocabulary speech recognition[J].arXiv：1610.09975，2016.
[20] ZWEIG G，YU C，DROPPO J，et al.Advances in all-neural speech recognition[C]//2017 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2017：4805-4809.
[21] ZEYER A，BECK E，SCHLüTER R，et al.CTC in the context of generalized full-sum HMM training[C]//Proceedings of INTERSPEECH，2017：944-948.
[22] LI J，YE G，DAS A，et al.Advancing acoustic-to-word CTC model[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：5794-5798.
[23] AUDHKHASI K，KINGSBURY B，RAMABHADRAN B，et al.Building competitive direct acoustics-to-word models for English conversational speech recognition[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：4759-4763.
[24] SAON G，TüSKE Z，BOLANOS D，et al.Advancing RNN transducer technology for speech recognition[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：5654-5658.
[25] PRABHAVALKAR R，RAO K，SAINATH T N，et al.A comparison of sequence-to-sequence models for speech recognition[C]//Proceedings of INTERSPEECH，2017：939-943.
[26] SAINATH T N，HE Y，LI B，et al.A streaming on-device end-to-end model surpassing server-side conventional model quality and latency[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：6059-6063.
[27] HE Y，SAINATH T N，PRABHAVALKAR R，et al.Streaming end-to-end speech recognition for mobile devices[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2019：6381-6385.
[28] BATTENBERG E，CHEN J，CHILD R，et al.Exploring neural transducers for end-to-end speech recognition[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop（ASRU），2017：206-213.
[29] LI J，ZHAO R，HU H，et al.Improving RNN transducer modeling for end-to-end speech recognition[C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop（ASRU），2019：114-121.
[30] ZHANG X，ZHANG F，LIU C，et al.Benchmarking LF-MMI，CTC and RNN-T criteria for streaming ASR[C]//2021 IEEE Spoken Language Technology Workshop（SLT），2021：46-51.
[31] PUNJABI S，ARSIKERE H，RAEESY Z，et al.Joint ASR and language identification using RNN-T：an efficient approach to dynamic language switching[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：7218-7222.
[32] DONG L，ZHOU S，CHEN W，et al.Extending recurrent neural aligner for streaming end-to-end speech recognition in mandarin[J].arXiv：1806.06342，2018.
[33] VINYALS O，TOSHEV A，BENGIO S，et al.Show and tell：a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3156-3164.
[34] CHOROWSKI J K，BAHDANAU D，SERDYUK D，et al.Attention-based models for speech recognition[C]//Advances in Neural Information Processing Systems，2015：577-585.
[35] KIM S，HORI T，WATANABE S.Joint CTC-attention based end-to-end speech recognition using multi-task learning[C]//2017 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2017：4835-4839.
[36] DONG L，XU S，XU B.Speech-transformer：a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：5884-5888.
[37] BAHDANAU D，CHO K，BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv：1409.0473，2014.
[38] LUONG M T，MANNING C D.Stanford neural machine translation systems for spoken language domains[C]//Proceedings of the 12th International Workshop on Spoken Language Translation：Evaluation Campaign，2015.
[39] CHIU C C，SAINATH T N，WU Y，et al.State-of-the-art speech recognition with sequence-to-sequence models[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：4774-4778.
[40] CHAN W，JAITLY N，LE Q，et al.Listen，attend and spell：a neural network for large vocabulary conversational speech recognition[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2016：4960-4964.
[41] RAFFEL C，LUONG M T，LIU P J，et al.Online and linear-time attention by enforcing monotonic alignments[C]//International Conference on Machine Learning，2017：2837-2846.
[42] TJANDRA A，SAKTI S，NAKAMURA S.Local monotonic attention mechanism for end-to-end speech and language processing[J].arXiv：1705.08091，2017.
[43] CHIU C C，RAFFEL C.Monotonic chunkwise attention[J].arXiv：1712.05382，2017.
[44] MA X，PINO J，CROSS J，et al.Monotonic multihead attention[J].arXiv：1909.12406，2019.
[45] MERBOLDT A，ZEYER A，SCHLüTER R，et al.An analysis of local monotonic attention variants[C]//Proceedings of INTERSPEECH，2019：1398-1402.
[46] JAITLY N，SUSSILLO D，LE Q V，et al.A neural transducer[J].arXiv：1511.04868，2015.
[47] TIAN Z，YI J，TAO J，et al.Self-attention transducers for end-to-end speech recognition[J].arXiv：1909.13037，2019.
[48] TSUNOO E，KASHIWAGI Y，WATANABE S.Streaming Transformer ASR with blockwise synchronous beam search[C]//2021 IEEE Spoken Language Technology Workshop（SLT），2021：22-29.
[49] TIAN Z，YI J，BAI Y，et al.Synchronous transformers for end-to-end speech recognition[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：7884-7888.
[50] SAINATH T N，CHIU C C，PRABHAVALKAR R，et al.Improving the performance of online neural transducer models[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：5864-5868.
[51] JAITLY N，LE Q V，VINYALS O，et al.An online sequence-to-sequence model using partial conditioning[C]//Advances in Neural Information Processing Systems，2016：1-11.
[52] DI GANGI M A，NEGRI M，TURCHI M.Adapting transformer to end-to-end spoken language translation[C]//Proceedings of INTERSPEECH，2019：1133-1137.
[53] GRAVES A.Adaptive computation time for recurrent neural networks[J].arXiv：1603.08983，2016.
[54] LI M，LIU M，MASANORI H.End-to-end speech recognition with adaptive computation steps[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2019：6246-6250.
[55] LI M，ZORIL? C，DODDIPATLA R.Transformer-based online speech recognition with decoder-end adaptive computation steps[C]//2021 IEEE Spoken Language Technology Workshop（SLT），2021：1-7.
[56] DONG L，XU B.Cif：continuous integrate-and-fire for end-to-end speech recognition[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：6079-6083.
[57] LUO J，WANG J，CHENG N，et al.Unidirectional memory-self-attention transducer for online speech recognition[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：910-914.
[58] WANG F，XU B.Shifted chunk encoder for transformer based streaming end-to-end ASR[J].arXiv：2203.15206，2022.
[59] MORITZ N，HORI T，LE ROUX J.Triggered attention for end-to-end speech recognition[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2019：5666-5670.
[60] ZHAO H，HIGUCHI Y，OGAWA T，et al.An investigation of enhancing CTC model for triggered attention-based streaming ASR[J].arXiv：2110.10402，2021.
[61] MORIYA T，ASHIHARA T，ANDO A，et al.Hybrid RNN-T/attention-based streaming ASR with triggered chunkwise attention and dual internal language model integration[C]//2022 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2022：8282-8286.
[62] YEH C F，MAHADEOKAR J，KALGAONKAR K，et al.Transformer-transducer：end-to-end speech recognition with self-attention[J].arXiv：1910.12977，2019.
[63] ZHANG Q，LU H，SAK H，et al.Transformer transducer：a streamable speech recognition model with transformer encoders and rnn-t loss[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：7829-7833.
[64] DALMIA S，LIU Y，RONANKI S，et al.Transformer-transducers for code-switched speech recognition[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：5859-5863.
[65] XIE Y，MACOSKEY J，RADFAR M，et al.Compute cost amortized transformer for streaming ASR[J].arXiv：2207.
02393，2022.
[66] SUN E，LI J，MENG Z，et al.Improving multilingual transformer transducer models by reducing language confusions[C]//Proceedings of INTERSPEECH，2021：3470-3474.
[67] CHEN X，WU Y，WANG Z，et al.Developing real-time streaming transformer transducer for speech recognition on large-scale dataset[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：5904-5908.
[68] SHI Y，WU C，WANG D，et al.Streaming transformer transducer based speech recognition using non-causal convolution[C]//2022 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2022：8277-8281.
[69] XIA W，LU H，WANG Q，et al.Turn-to-diarize：online speaker diarization constrained by transformer transducer speaker turn detection[C]//2022 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2022：8077-8081.
[70] SHANGGUAN Y，PRABHAVALKAR R，SU H，et al.Dissecting user-perceived latency of on-device E2E speech recognition[J].arXiv：2104.02207，2021.
[71] BA J L，KIROS J R，HINTON G E.Layer normalization[J].arXiv：1607.06450，2016.
[72] WANG C，WU Y，LIU S，et al.Low latency end-to-end streaming speech recognition with a scout network[J].arXiv：2003.10369，2020.
[73] INAGUMA H，GAUR Y，LU L，et al.Minimum latency training strategies for streaming sequence-to-sequence ASR[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：6064-6068.
[74] SAINATH T N，HE Y，LI B，et al.A streaming on-device end-to-end model surpassing server-side conventional model quality and latency[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：6059-6063.
[75] KIM J，LU H，TRIPATHI A，et al.Reducing streaming ASR model delay with self alignment[J].arXiv：2105.05005，2021.
[76] YU J，CHIU C C，LI B，et al.Fastemit：low-latency streaming asr with sequence-level emission regularization[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：6004-6008.
[77] LEE K F，HON H W，REDDY R.An overview of the SPHINX speech recognition system[J].IEEE Transactions on Acoustics，Speech，and Signal Processing，1990，38（1）：35-45.
[78] GALES M J F.Maximum likelihood linear transformations for HMM-based speech recognition[J].Computer Speech & Language，1998，12（2）：75-98.
[79] SHANNON M.Optimizing expected word error rate via sampling for speech recognition[J].arXiv：1706.02776，2017.
[80] PRABHAVALKAR R，SAINATH T N，WU Y，et al.Minimum word error rate training for attention-based sequence-to-sequence models[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：4839-4843.
[81] CUI J，WENG C，WANG G，et al.Improving attention-based end-to-end ASR systems with sequence-based loss functions[C]//2018 IEEE Spoken Language Technology Workshop（SLT），2018：353-360.
[82] WENG C，YU C，CUI J，et al.Minimum bayes risk training of RNN-transducer for end-to-end speech recognition[J].arXiv：1911.12487，2019.
[83] GUO J，TIWARI G，DROPPO J，et al.Efficient minimum word error rate training of RNN-transducer for end-to-end speech recognition[J].arXiv：2007.13802，2020.
[84] TAKASHIMA R，LI S，KAWAI H.An investigation of a knowledge distillation method for CTC acoustic models[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：5809-5813.
[85] KOJIMA A.Knowledge distillation for streaming transformer-transducer[C]//Proceedings of INTERSPEECH，2021：2841-2845.
[86] INAGUMA H，KAWAHARA T.Alignment knowledge distillation for online streaming attention-based speech recognition[J].arXiv：2103.00422，2021.
[87] PANCHAPAGESAN S，PARK D S，CHIU C C，et al.Efficient knowledge distillation for rnn-transducer models[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：5639-5643.
[88] DOUTRE T，HAN W，MA M，et al.Improving streaming automatic speech recognition with non-streaming model distillation on unsupervised data[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：6558-6562.
[89] KIM Y，RUSH A M.Sequence-level knowledge distillation[J].arXiv：1606.07947，2016.
[90] KURATA G，SAON G.Knowledge distillation from offline to streaming RNN transducer for end-to-end speech recognition[C]//Proceedings of INTERSPEECH，2020：2117-2121.
[91] WANG D，ZHANG X.Thchs-30：a free Chinese speech corpus[J].arXiv：1512.01882，2015.
[92] BU H，DU J，NA X，et al.Aishell-1：an open-source Mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment（O-COCOSDA），2017：1-5.
[93] 王东，王丽媛，王大亮，等.DTZH1505：大规模开源中文普通话语音库[J].计算机工程与应用，2022，58（11）：295-301.
WANG D，WANG L Y，WANG D L，et al.DTZH1505：large scale open source Mandarin speech corpus[J].Computer Engineering and Applications，2022，58（11）：295-301.
[94] DU J，NA X，LIU X，et al.Aishell-2：transforming Mandarin asr research into industrial scale[J].arXiv：1808.10583，2018.
[95] GAROFOLO J S，LAMEL L F，FISHER W M，et al.DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM[R].1993.
[96] ROUSSEAU A，DELéGLISE P，ESTEVE Y.TED-LIUM：an automatic speech recognition dedicated corpus[C]//Proceedings of LREC，2012：125-129.
[97] PANAYOTOV V，CHEN G，POVEY D，et al.Librispeech：an asrcorpus based on public domain audio books[C]//2015 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2015：5206-5210.
[98] ARDILA R，BRANSON M，DAVIS K，et al.Common voice：a massively-multilingual speech corpus[J].arXiv：1912.
06670，2019.
[99] PRATAP V，XU Q，SRIRAM A，et al.Mls：a large-scale multilingual dataset for speech research[J].arXiv：2012.
03411，2020.
[100] GALVEZ D，DIAMOS G，CIRO J，et al.The People’s speech：a large-scale diverse English speech recognition dataset for commercial usage[J].arXiv：2111.09344，2021.
[101] CHEN G，CHAI S，WANG G，et al.Gigaspeech：an evolving，multi-domain asr corpus with 10，000 hours of transcribed audio[J].arXiv：2106.06909，2021.
[102] MCCOWAN I A，MOORE D，DINES J，et al.On the use of information retrieval measures for speech recognition evaluation[R].LIDIAP，2004：1-13.
[103] ZHANG B，WU D，YAO Z，et al.Unified streaming and non-streaming two-pass end-to-end model for speech recognition[J].arXiv：2012.05481，2020.
[104] LENG Y，TAN X，WANG R，et al.Fastcorrect 2：fast error correction on multiple candidates for automatic speech recognition[J].arXiv：2109.14420，2021.
[105] HE Y，SAINATH T N，PRABHAVALKAR R，et al.Streaming end-to-end speech recognition for mobile devices[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2019：6381-6385.
[106] KIM K，LEE K，GOWDA D，et al.Attention based on-device streaming speech recognition with large speech corpus[C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop（ASRU），2019：956-963.
[107] GARG A，VADISETTI G P，GOWDA D，et al.Streaming on-device end-to-end ASR system for privacy-sensitive voice-typing[C]//Proceedings of INTERSPEECH，2020：3371-3375.
[108] SAINATH T N，HE Y，LI B，et al.A streaming on-device end-to-end model surpassing server-side conventional model quality and latency[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：6059-6063.
[109] OH Y R，PARK K.On-device streaming transformer-based end-to-end speech recognition[J].Proceedings of INTERSPEECH，2021：967-968.
[110] ZHANG Y，SUN S，MA L.Tiny transducer：a highly-efficient speech recognition model on edge devices[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：6024-6028.
[111] 颜永红，张鹏远，徐及，等.智能语音能力平台关键技术及其在智能客服行业应用[EB/OL].（2020-05-10）[2022-09-12].https：//kns.cnki.net/KCMS/detail/detail.aspx?dbname=SNAD&filename=SNAD000001823879.
YAN Y H，ZHANG P Y，XU J，et al.Key technologies of intelligent voice capability platform and its application in intelligent customer service industry[EB/OL].（2020-05-10）[2022-09-12].https：//kns.cnki.net/KCMS/detail/detail.aspx?dbname=SNAD&filename=SNAD000001823879.