Review of Research on Fusion Technology of Speech Recognition and Large Language Models

doi:10.3778/j.issn.1002-8331.2405-0145

Abstract

Abstract: In the current era, various large language models （LLMs） have emerged, driving the development and innovation in many fields of artificial intelligence. Summarizing the positive effects of LLMs in speech recognition technology and exploring its development prospects can provide innovative ideas for the advancement of speech recognition technology. In current mainstream end-to-end speech recognition models, additional language models are often used to rescore the speech recognition results or combined with WFST algorithm to assist in decoding, thereby improving the accuracy of the speech recognition results. Recent studies have found that integrating LLMs into the end-to-end training of speech recognition models can further enhance the accuracy of the recognition results. Taking the three types of speech recognition and language model fusion methods, shallow fusion, deep fusion, and cold fusion, as the main line, and their principles and advantages and disadvantages are analyzed. Recent experiments by researchers have confirmed that combining LLMs with acoustic models can effectively improve recognition accuracy. After systematically reviewing the research progress of LLMs in ASR technology, it is also revealed that the models play an important role in the speech recognition area. The related technology integration of speech recognition and LLMs has gradually matured, presenting that it is valuable to commit further exploration and in-depth research.

Key words: speech recognition, large language model, deep learning

摘要： 在当今时代背景下，多种大语言模型层出不穷，推动了人工智能众多领域的发展和创新。归纳大语言模型在语音识别技术中的积极作用，并探讨其发展前景，可以为语音识别技术的发展提供创新思路。在目前主流的端到端语音识别模型中，常使用额外的语言模型对语音识别结果重打分或结合WFST算法辅助解码来提升语音识别结果的准确率。最新研究发现，将大型语言模型融入语音识别模型的端到端训练中，能够更好地提升语音识别结果的准确率。以浅融合、深度融合、冷融合三类语音识别与语言模型的融合方式为主线，进行了其原理及优劣的分析。近期研究者的实验结果证实，大语言模型与声学模型相结合能够有效提高识别精度。在系统地梳理了大语言模型在语音识别技术中的研究进展后，其在语音识别中的重要作用也得以揭示。语音识别与大语言模型融合的相关技术已经逐渐成熟，值得进一步的探索与深入研究。

关键词: 语音识别, 大语言模型, 深度学习

WANG Jingkai, QIN Donghong, BAI Fengbo, LI Lulu, KONG Lingru, XU Chen. Review of Research on Fusion Technology of Speech Recognition and Large Language Models[J]. Computer Engineering and Applications, 2025, 61(6): 53-63.

王敬凯, 秦董洪, 白凤波, 李路路, 孔令儒, 徐晨. 语音识别与大语言模型融合技术研究综述[J]. 计算机工程与应用, 2025, 61(6): 53-63.

References

[1] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019: 4171-4186.
[2] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems, 2020: 1877-1901.
[3] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
[4] GULCEHRE C, FIRAT O, XU K, et al. On integrating a language model into neural machine translation[J]. Computer Speech & Language, 2017, 45: 137-148.
[5] HOSSEINI-ASL E, MCCANN B, WU C S, et al. A simple language model for task-oriented dialogue[C]//Advances in Neural Information Processing Systems, 2020: 20179-20191.
[6] LIU F, LI G, ZHAO Y, et al. Multi-task learning based pre-trained language model for code completion[C]//Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020: 473-485.
[7] DIGHE P, SU Y, ZHENG S, et al. Leveraging large language models for exploiting asr uncertainty[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 12231-12235.
[8] HUANG R, LI M, YANG D, et al. AudioGPT: understanding and generating speech, music, sound, and talking head[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2024: 23802-23804.
[9] FATHULLAH Y, WU C, LAKOMKIN E, et al. Prompting large language models with speech recognition abilities[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 13351-13355.
[10] 马晗, 唐柔冰, 张义, 等. 语音识别研究综述[J]. 计算机系统应用, 2022, 31(1): 1-10.
MA H, TANG R B, ZHANG Y, et al. Survey on speech recognition[J]. Computer Systems & Applications, 2022, 31(1): 1-10.
[11] 王澳回, 张珑, 宋文宇, 等. 端到端流式语音识别研究综述[J]. 计算机工程与应用, 2023, 59(2): 22-33.
WANG A H, ZHANG L, SONG W Y, et al. Review of end-to-end streaming speech recognition[J]. Computer Engineering and Applications, 2023, 59(2): 22-33.
[12] LI J. Recent advances in end-to-end automatic speech recognition[J]. APSIPA Transactions on Signal and Information Processing, 2022, 11(1).
[13] PRABHAVALKAR R, HORI T, SAINATH T N, et al. End-to-end speech recognition: a survey[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 32: 325-351.
[14] PANAYOTOV V, CHEN G, POVEY D, et al. Librispeech: an ASR corpus based on public domain audio books[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2015: 5206-5210.
[15] GAROFOLO J S, LAMEL L F, FISHER W M, et al. TIMIT acoustic-phonetic continuous speech corpus: LDC93S1[R]. Philadelphia: Linguistic Data Consortium, 1993.
[16] BU H, DU J, NA X, et al. Aishell-1: an open-source Mandarin speech corpus and a speech recognition baseline[C]//Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment(OCOCOSDA), 2017: 1-5.
[17] DU J, NA X, LIU X, et al. AISHELL-2: transforming Mandarin ASR research into industrial scale[J]. arXiv:1808.10583, 2018.
[18] SHI Y, BU H, XU X, et al. AISHELL-3: a multi-speaker Mandarin TTS corpus and the baselines[J]. arXiv:2010.11567, 2020.
[19] ARDILA R, BRANSON M, DAVIS K, et al. Common voice: a massively-multilingual speech corpus[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020: 4218-4222.
[20] WANG C, RIVIèRE M, LEE A, et al. VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics（ACL 2021）, 2021.
[21] PRATAP V, XU Q, SRIRAM A, et al. MLS: a large-scale multilingual dataset for speech research[J]. Interspeech, 2020: 226202134.
[22] RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286.
[23] GRAVES A, FERNáNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning, 2006: 369-376.
[24] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2016: 4945-4949.
[25] HAN W, ZHANG Z, ZHANG Y, et al. ContextNet: improving convolutional neural networks for automatic speech recognition with global context[J]. Interspeech, 2020: 3610-3614.
[26] DENG K, CAO S, ZHANG Y, et al. Improving CTC-based speech recognition via knowledge transferring from pre-trained language models[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2022: 8517-8521.
[27] GRAVES A. Sequence transduction with recurrent neural networks[J]. arXiv:1211.3711, 2012.
[28] MORIYA T, ASHIHARA T, TANAKA T, et al. SimpleFlat: a simple whole-network pre-training approach for RNN transducer-based end-to-end speech recognition[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2021: 5664-5668.
[29] MORITZ N, HORI T, WATANABE S, et al. Sequence transduction with graph-based supervision[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2022: 7212-7216.
[30] CHAN W, JAITLY N, LE Q V, et al. Listen, attend and spell[J]. arXiv:1508.01211, 2015.
[31] RAFFEL C, LUONG M T, LIU P J, et al. Online and linear-time attention by enforcing monotonic alignments[C]//Proceedings of the International Conference on Machine Learning, 2017: 2837-2846.
[32] CHIU C C, RAFFEL C. Monotonic chunkwise attention[C]//Proceedings of the International Conference on Learning Representations, 2018.
[33] FAN R, ZHOU P, CHEN W, et al. An online attention-based model for speech recognition[J]. Interspeech, 2019: 53290672.
[34] MORITZ N, HORI T, LE ROUX J. Triggered attention for end-to-end speech recognition[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2019: 5666-5670.
[35] INAGUMA H, GAUR Y, LU L, et al. Minimum latency training strategies for streaming sequence-to-sequence ASR[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 6064-6068.
[36] WANG C, WU Y, LU L, et al. Low latency end-to-end streaming speech recognition with a scout network[J]. Interspeech, 2020: 218918743.
[37] TSUNOO E, KASHIWAGI Y, WATANABE S. Streaming transformer asr with blockwise synchronous beam search[C]//Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT). Piscataway: IEEE, 2021: 22-29.
[38] KIM J, KUMAR M, GOWDA D, et al. A comparison of streaming models and data augmentation methods for robust speech recognition[C]//Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway: IEEE, 2021: 989-995.
[39] WATANABE S, HORI T, KIM S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253.
[40] 许鸿奎, 张子枫, 卢江坤, 等. 混合CTC/Attention模型在普通话识别中的应用[J]. 计算机与现代化, 2022(8): 1-6.
XU H K, ZHANG Z F, LU J K, et al. Application of hybrid CTC/Attention model in mandarin recognition[J]. Computer and Modernization, 2022(8): 1-6.
[41] DENG K, CAO S, ZHANG Y, et al. Improving hybrid CTC/Attention end-to-end speech recognition with pretrained acoustic and language models[C]//Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway: IEEE, 2021: 76-82.
[42] MIAO H, CHENG G, ZHANG P, et al. Online hybrid CTC/Attention end-to-end automatic speech recognition architecture[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1452-1465.
[43] YANG D H, CHANG J H. Attention-based latent features for jointly trained end-to-end automatic speech recognition with modified speech enhancement[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(3): 202-210.
[44] 金秀丽. 端到端语音识别算法研究与实现[D]. 兰州: 兰州交通大学, 2023.
JIN X L. Research and implementation of end?to?end speech recognition algorithm[D]. Lanzhou: Lanzhou Jiaotong University, 2023.
[45] ANIL R, DAI A M, FIRAT O, et al. Palm 2 technical report[J]. arXiv:2305.10403, 2023.
[46] BORSOS Z, MARINIER R, VINCENT D, et al. AudioLM: a language modeling approach to audio generation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023: 48550.
[47] RUBENSTEIN P K, ASAWAROENGCHAI C, NGUYEN D D, et al. Audiopalm: a large language model that can speak and listen[J]. arXiv:2306.12925, 2023.
[48] LAKOMKIN E, WU C, FATHULLAH Y, et al. End-to-end speech recognition contextualization with large language models[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 12406-12410.
[49] SHU Y, DONG S, CHEN G, et al. LLaSM: large language and speech model[J]. arXiv:2308.15930, 2023.
[50] REID M, SAVINOV N, TEPLYASHIN D, et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context[J]. arXiv:2403.05530, 2024.
[51] KANNAN A, WU Y, NGUYEN P, et al. An analysis of incorporating an external language model into a sequence-to-sequence model[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 1-5828.
[52] SRIRAM A, JUN H, SATHEESH S, et al. Cold fusion: training Seq2Seq models together with language models[J]. Interspeech, 2018: 31004450.
[53] GULCEHRE C, FIRAT O, XU K, et al. On using monolingual corpora in neural machine translation[J]. arXiv:1503. 03535, 2015.
[54] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[55] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2978-2988.
[56] SHAN C, WENG C, WANG G, et al. Component fusion: learning replaceable language model component for end-to-end speech recognition system[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2019: 5361-5635.
[57] HWANG K, SUNG W. Character-level language modeling with hierarchical recurrent neural networks[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2017: 5720-5724.
[58] HORI T, WATANABE S, ZHANG Y, et al. Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM[J]. Interspeech, 2017: 19423475.
[59] VARIANI E, RYBACH D, ALLAUZEN C, et al. Hybrid autoregressive transducer (HAT)[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 6139-6143.
[60] CHEN T, ALLAUZEN C, HUANG Y, et al. Large-scale language model rescoring on long-form data[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2023: 1-5.
[61] HU K, SAINATH T N, LI B, et al. Massively multilingual shallow fusion with large language models[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2023: 1-5.
[62] DU N, HUANG Y, DAI A M, et al. GLaM: efficient scaling of language models with mixture-of-experts[C]//Proceedings of the International Conference on Machine Learning, 2022: 5547-5569.
[63] LI Y, WU Y, LI J, et al. Prompting large language models for zero-shot domain adaptation in speech recognition[J]. arXiv:2306.16007, 2023.
[64] LING S, HU Y, QIAN S, et al. Adapting large language model with speech for fully formatted end-to-end speech recognition[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 11046-11050.
[65] FATHULLAH Y, WU C, LAKOMKIN E, et al. Prompting large language models with speech recognition abilities[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 13351-13355.
[66] WORKSHOP B S, SCAO T L, FAN A, et al. BLOOM: a 176B-parameter open-access multilingual language model[J]. arXiv:2211.05100, 2022.
[67] WANG M, HAN W, SHAFRAN I, et al. SLM: bridge the thin gap between speech and text foundation models[J]. arXiv:2310.00230, 2023.
[68] MUENNIGHOFF N, WANG T, SUTAWIKA L, et al. Crosslingual generalization through multitask finetuning[C]//Proceedings of the 61st Annual Meeting of The Association For Computational Linguistics, 2023.
[69] XUE L, CONSTANT N, ROBERTS A, et al. mT5: a massively multilingual pre-trained text-to-text transformer[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 483-498.
[70] WU J, GAUR Y, CHEN Z, et al. On decoder-only architecture for speech-to-text and large language model integration[J]. arXiv:2307.03917, 2023.
[71] ZHANG D, LI S, ZHANG X, et al. SpeechGPT: empowering large language models with intrinsic cross-modal conversational abilities[C]//Findings of the Association for Computational Linguistics (EMNLP 2023), 2023: 15757-15773.
[72] 任芳慧, 郭熙铜, 彭昕, 等. 医疗领域对话系统口语理解综述[J]. 中文信息学报, 2024, 38(1): 24-35.
REN F H, GUO X T, PENG X, et al. A survey of spoken language understanding in medical field[J]. Journal of Chinese Information Processing, 2024, 38(1): 24-35.
[73] 邝展鹏. 语音交互设计与研究——以金融自助终端设备语音交互设计为例[D]. 广州: 华南理工大学, 2019.
KUANG Z P. Design and research of voice interaction—taking the voice interaction design of financial self-help terminal equipment as an example[D]. Guangzhou: South China University of Technology, 2019.