Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (2): 22-33.DOI: 10.3778/j.issn.1002-8331.2206-0306

• Research Hotspots and Reviews • Previous Articles     Next Articles

Review of End-to-End Streaming Speech Recognition

WANG Aohui, ZHANG Long, SONG Wenyu, MENG Jie   

  1. 1.College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
    2.College of Computer Information Engineering, Guangzhou Huali Vocational College of Science and Technology, Guangzhou 511325, China
  • Online:2023-01-15 Published:2023-01-15



  1. 1.天津师范大学 计算机与信息工程学院,天津 300387
    2.广州华立科技职业学院 计算机信息工程学院,广州 511325

Abstract: Speech recognition is an important way to realize human-computer interaction and the basic link of natural language processing. With the development of artificial intelligence technology, streaming speech recognition is required in a large number of application scenarios such as human-computer interaction. Streaming speech recognition is defined as input speech and output result. It can greatly reduce the processing time of speech recognition in human-computer interaction. At present, end-to-end speech recognition has achieved fruitful research achievements in the academic research field, while streaming speech recognition still has some challenges and difficulties in academic research and industrial applications. Therefore, in the last two years, end-to-end speech recognition has gradually become a research hotspot and focus in the field of speech. From the aspects of end-to-end streaming recognition model and performance optimization, the research in recent years is comprehensively investigated and analyzed, including the following contents:(1)Various methods and models of end-to-end streaming speech recognition are analyzed and summarized in detail, including CTC and RNN-T models which directly realize streaming speech recognition, and monotone attention mechanism which improves attention mechanism to realize streaming speech recognition. (2)The methods to improve the recognition accuracy and reduce the delay of the end-to-end streaming speech recognition model are introduced. In terms of improving the accuracy, there are mainly methods such as minimum word error rate training and knowledge distillation, and in terms of reducing the delay, there are mainly methods such as alignment and regularization. (3)Some common Chinese and English open source data sets and performance evaluation criteria of streaming speech recognition models are introduced. (4)The future development and prospect of the end-to-end streaming speech recognition model are discussed.

Key words: human-computer interaction, speech recognition, end to end, streaming, delay

摘要: 语音识别是实现人机交互的一种重要途径,是自然语言处理的基础环节,随着人工智能技术的发展,人机交互等大量应用场景存在着流式语音识别的需求。流式语音识别的定义是一边输入语音一边输出结果,它能够大大减少人机交互过程中语音识别的处理时间。目前在学术研究领域,端到端语音识别已经取得了丰硕的研究成果,而流式语音识别在学术研究以及工业应用中还存在着一些挑战与困难,因此,最近两年,端到端流式语音识别逐渐成为语音领域的一个研究热点与重点。从端到端流式识别模型与性能优化等方面对近些年所展开的研究进行全面的调查与分析,具体包括以下内容:(1)详细分析和归纳了端到端流式语音识别的各种方法与模型,包括直接实现流式识别的CTC与RNN-T模型,以及对注意力机制进行改进以实现流式识别的单调注意力机制等方法;(2)介绍了端到端流式语音识别模型提高识别准确率与减少延迟的方法,在提高准确率方面,主要有最小词错率训练、知识蒸馏等方法,在降低延迟方面,主要有对齐、正则化等方法;(3)介绍了流式语音识别一些常用的中英文开源数据集以及流式识别模型的性能评价标准;(4)讨论了端到端流式语音识别模型的未来发展与展望。

关键词: 人机交互, 语音识别, 端到端, 流式, 延迟