Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (6): 53-63.DOI: 10.3778/j.issn.1002-8331.2405-0145

• Research Hotspots and Reviews • Previous Articles     Next Articles

Review of Research on Fusion Technology of Speech Recognition and Large Language Models

WANG Jingkai, QIN Donghong, BAI Fengbo, LI Lulu, KONG Lingru, XU Chen   

  1. College of Artificial Intelligence, Guangxi Minzu University, Nanning 530000, China
  • Online:2025-03-15 Published:2025-03-14

语音识别与大语言模型融合技术研究综述

王敬凯,秦董洪,白凤波,李路路,孔令儒,徐晨   

  1. 广西民族大学 人工智能学院,南宁 530000

Abstract: In the current era, various large language models (LLMs) have emerged, driving the development and innovation in many fields of artificial intelligence. Summarizing the positive effects of LLMs in speech recognition technology and exploring its development prospects can provide innovative ideas for the advancement of speech recognition technology. In current mainstream end-to-end speech recognition models, additional language models are often used to rescore the speech recognition results or combined with WFST algorithm to assist in decoding, thereby improving the accuracy of the speech recognition results. Recent studies have found that integrating LLMs into the end-to-end training of speech recognition models can further enhance the accuracy of the recognition results. Taking the three types of speech recognition and language model fusion methods, shallow fusion, deep fusion, and cold fusion, as the main line, and their principles and advantages and disadvantages are analyzed. Recent experiments by researchers have confirmed that combining LLMs with acoustic models can effectively improve recognition accuracy. After systematically reviewing the research progress of LLMs in ASR technology, it is also revealed that the models play an important role in the speech recognition area. The related technology integration of speech recognition and LLMs has gradually matured, presenting that it is valuable to commit further exploration and in-depth research.

Key words: speech recognition, large language model, deep learning

摘要: 在当今时代背景下,多种大语言模型层出不穷,推动了人工智能众多领域的发展和创新。归纳大语言模型在语音识别技术中的积极作用,并探讨其发展前景,可以为语音识别技术的发展提供创新思路。在目前主流的端到端语音识别模型中,常使用额外的语言模型对语音识别结果重打分或结合WFST算法辅助解码来提升语音识别结果的准确率。最新研究发现,将大型语言模型融入语音识别模型的端到端训练中,能够更好地提升语音识别结果的准确率。以浅融合、深度融合、冷融合三类语音识别与语言模型的融合方式为主线,进行了其原理及优劣的分析。近期研究者的实验结果证实,大语言模型与声学模型相结合能够有效提高识别精度。在系统地梳理了大语言模型在语音识别技术中的研究进展后,其在语音识别中的重要作用也得以揭示。语音识别与大语言模型融合的相关技术已经逐渐成熟,值得进一步的探索与深入研究。

关键词: 语音识别, 大语言模型, 深度学习