计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (6): 53-63.DOI: 10.3778/j.issn.1002-8331.2405-0145
王敬凯,秦董洪,白凤波,李路路,孔令儒,徐晨
出版日期:
2025-03-15
发布日期:
2025-03-14
WANG Jingkai, QIN Donghong, BAI Fengbo, LI Lulu, KONG Lingru, XU Chen
Online:
2025-03-15
Published:
2025-03-14
摘要: 在当今时代背景下,多种大语言模型层出不穷,推动了人工智能众多领域的发展和创新。归纳大语言模型在语音识别技术中的积极作用,并探讨其发展前景,可以为语音识别技术的发展提供创新思路。在目前主流的端到端语音识别模型中,常使用额外的语言模型对语音识别结果重打分或结合WFST算法辅助解码来提升语音识别结果的准确率。最新研究发现,将大型语言模型融入语音识别模型的端到端训练中,能够更好地提升语音识别结果的准确率。以浅融合、深度融合、冷融合三类语音识别与语言模型的融合方式为主线,进行了其原理及优劣的分析。近期研究者的实验结果证实,大语言模型与声学模型相结合能够有效提高识别精度。在系统地梳理了大语言模型在语音识别技术中的研究进展后,其在语音识别中的重要作用也得以揭示。语音识别与大语言模型融合的相关技术已经逐渐成熟,值得进一步的探索与深入研究。
王敬凯, 秦董洪, 白凤波, 李路路, 孔令儒, 徐晨. 语音识别与大语言模型融合技术研究综述[J]. 计算机工程与应用, 2025, 61(6): 53-63.
WANG Jingkai, QIN Donghong, BAI Fengbo, LI Lulu, KONG Lingru, XU Chen. Review of Research on Fusion Technology of Speech Recognition and Large Language Models[J]. Computer Engineering and Applications, 2025, 61(6): 53-63.
[1] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019: 4171-4186. [2] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems, 2020: 1877-1901. [3] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551. [4] GULCEHRE C, FIRAT O, XU K, et al. On integrating a language model into neural machine translation[J]. Computer Speech & Language, 2017, 45: 137-148. [5] HOSSEINI-ASL E, MCCANN B, WU C S, et al. A simple language model for task-oriented dialogue[C]//Advances in Neural Information Processing Systems, 2020: 20179-20191. [6] LIU F, LI G, ZHAO Y, et al. Multi-task learning based pre-trained language model for code completion[C]//Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020: 473-485. [7] DIGHE P, SU Y, ZHENG S, et al. Leveraging large language models for exploiting asr uncertainty[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 12231-12235. [8] HUANG R, LI M, YANG D, et al. AudioGPT: understanding and generating speech, music, sound, and talking head[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2024: 23802-23804. [9] FATHULLAH Y, WU C, LAKOMKIN E, et al. Prompting large language models with speech recognition abilities[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 13351-13355. [10] 马晗, 唐柔冰, 张义, 等. 语音识别研究综述[J]. 计算机系统应用, 2022, 31(1): 1-10. MA H, TANG R B, ZHANG Y, et al. Survey on speech recognition[J]. Computer Systems & Applications, 2022, 31(1): 1-10. [11] 王澳回, 张珑, 宋文宇, 等. 端到端流式语音识别研究综述[J]. 计算机工程与应用, 2023, 59(2): 22-33. WANG A H, ZHANG L, SONG W Y, et al. Review of end-to-end streaming speech recognition[J]. Computer Engineering and Applications, 2023, 59(2): 22-33. [12] LI J. Recent advances in end-to-end automatic speech recognition[J]. APSIPA Transactions on Signal and Information Processing, 2022, 11(1). [13] PRABHAVALKAR R, HORI T, SAINATH T N, et al. End-to-end speech recognition: a survey[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 32: 325-351. [14] PANAYOTOV V, CHEN G, POVEY D, et al. Librispeech: an ASR corpus based on public domain audio books[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2015: 5206-5210. [15] GAROFOLO J S, LAMEL L F, FISHER W M, et al. TIMIT acoustic-phonetic continuous speech corpus: LDC93S1[R]. Philadelphia: Linguistic Data Consortium, 1993. [16] BU H, DU J, NA X, et al. Aishell-1: an open-source Mandarin speech corpus and a speech recognition baseline[C]//Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment(OCOCOSDA), 2017: 1-5. [17] DU J, NA X, LIU X, et al. AISHELL-2: transforming Mandarin ASR research into industrial scale[J]. arXiv:1808.10583, 2018. [18] SHI Y, BU H, XU X, et al. AISHELL-3: a multi-speaker Mandarin TTS corpus and the baselines[J]. arXiv:2010.11567, 2020. [19] ARDILA R, BRANSON M, DAVIS K, et al. Common voice: a massively-multilingual speech corpus[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020: 4218-4222. [20] WANG C, RIVIèRE M, LEE A, et al. VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics(ACL 2021), 2021. [21] PRATAP V, XU Q, SRIRAM A, et al. MLS: a large-scale multilingual dataset for speech research[J]. Interspeech, 2020: 226202134. [22] RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286. [23] GRAVES A, FERNáNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning, 2006: 369-376. [24] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2016: 4945-4949. [25] HAN W, ZHANG Z, ZHANG Y, et al. ContextNet: improving convolutional neural networks for automatic speech recognition with global context[J]. Interspeech, 2020: 3610-3614. [26] DENG K, CAO S, ZHANG Y, et al. Improving CTC-based speech recognition via knowledge transferring from pre-trained language models[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2022: 8517-8521. [27] GRAVES A. Sequence transduction with recurrent neural networks[J]. arXiv:1211.3711, 2012. [28] MORIYA T, ASHIHARA T, TANAKA T, et al. SimpleFlat: a simple whole-network pre-training approach for RNN transducer-based end-to-end speech recognition[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2021: 5664-5668. [29] MORITZ N, HORI T, WATANABE S, et al. Sequence transduction with graph-based supervision[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2022: 7212-7216. [30] CHAN W, JAITLY N, LE Q V, et al. Listen, attend and spell[J]. arXiv:1508.01211, 2015. [31] RAFFEL C, LUONG M T, LIU P J, et al. Online and linear-time attention by enforcing monotonic alignments[C]//Proceedings of the International Conference on Machine Learning, 2017: 2837-2846. [32] CHIU C C, RAFFEL C. Monotonic chunkwise attention[C]//Proceedings of the International Conference on Learning Representations, 2018. [33] FAN R, ZHOU P, CHEN W, et al. An online attention-based model for speech recognition[J]. Interspeech, 2019: 53290672. [34] MORITZ N, HORI T, LE ROUX J. Triggered attention for end-to-end speech recognition[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2019: 5666-5670. [35] INAGUMA H, GAUR Y, LU L, et al. Minimum latency training strategies for streaming sequence-to-sequence ASR[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 6064-6068. [36] WANG C, WU Y, LU L, et al. Low latency end-to-end streaming speech recognition with a scout network[J]. Interspeech, 2020: 218918743. [37] TSUNOO E, KASHIWAGI Y, WATANABE S. Streaming transformer asr with blockwise synchronous beam search[C]//Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT). Piscataway: IEEE, 2021: 22-29. [38] KIM J, KUMAR M, GOWDA D, et al. A comparison of streaming models and data augmentation methods for robust speech recognition[C]//Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway: IEEE, 2021: 989-995. [39] WATANABE S, HORI T, KIM S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253. [40] 许鸿奎, 张子枫, 卢江坤, 等. 混合CTC/Attention模型在普通话识别中的应用[J]. 计算机与现代化, 2022(8): 1-6. XU H K, ZHANG Z F, LU J K, et al. Application of hybrid CTC/Attention model in mandarin recognition[J]. Computer and Modernization, 2022(8): 1-6. [41] DENG K, CAO S, ZHANG Y, et al. Improving hybrid CTC/Attention end-to-end speech recognition with pretrained acoustic and language models[C]//Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway: IEEE, 2021: 76-82. [42] MIAO H, CHENG G, ZHANG P, et al. Online hybrid CTC/Attention end-to-end automatic speech recognition architecture[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1452-1465. [43] YANG D H, CHANG J H. Attention-based latent features for jointly trained end-to-end automatic speech recognition with modified speech enhancement[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(3): 202-210. [44] 金秀丽. 端到端语音识别算法研究与实现[D]. 兰州: 兰州交通大学, 2023. JIN X L. Research and implementation of end?to?end speech recognition algorithm[D]. Lanzhou: Lanzhou Jiaotong University, 2023. [45] ANIL R, DAI A M, FIRAT O, et al. Palm 2 technical report[J]. arXiv:2305.10403, 2023. [46] BORSOS Z, MARINIER R, VINCENT D, et al. AudioLM: a language modeling approach to audio generation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023: 48550. [47] RUBENSTEIN P K, ASAWAROENGCHAI C, NGUYEN D D, et al. Audiopalm: a large language model that can speak and listen[J]. arXiv:2306.12925, 2023. [48] LAKOMKIN E, WU C, FATHULLAH Y, et al. End-to-end speech recognition contextualization with large language models[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 12406-12410. [49] SHU Y, DONG S, CHEN G, et al. LLaSM: large language and speech model[J]. arXiv:2308.15930, 2023. [50] REID M, SAVINOV N, TEPLYASHIN D, et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context[J]. arXiv:2403.05530, 2024. [51] KANNAN A, WU Y, NGUYEN P, et al. An analysis of incorporating an external language model into a sequence-to-sequence model[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 1-5828. [52] SRIRAM A, JUN H, SATHEESH S, et al. Cold fusion: training Seq2Seq models together with language models[J]. Interspeech, 2018: 31004450. [53] GULCEHRE C, FIRAT O, XU K, et al. On using monolingual corpora in neural machine translation[J]. arXiv:1503. 03535, 2015. [54] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9. [55] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2978-2988. [56] SHAN C, WENG C, WANG G, et al. Component fusion: learning replaceable language model component for end-to-end speech recognition system[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2019: 5361-5635. [57] HWANG K, SUNG W. Character-level language modeling with hierarchical recurrent neural networks[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2017: 5720-5724. [58] HORI T, WATANABE S, ZHANG Y, et al. Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM[J]. Interspeech, 2017: 19423475. [59] VARIANI E, RYBACH D, ALLAUZEN C, et al. Hybrid autoregressive transducer (HAT)[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 6139-6143. [60] CHEN T, ALLAUZEN C, HUANG Y, et al. Large-scale language model rescoring on long-form data[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2023: 1-5. [61] HU K, SAINATH T N, LI B, et al. Massively multilingual shallow fusion with large language models[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2023: 1-5. [62] DU N, HUANG Y, DAI A M, et al. GLaM: efficient scaling of language models with mixture-of-experts[C]//Proceedings of the International Conference on Machine Learning, 2022: 5547-5569. [63] LI Y, WU Y, LI J, et al. Prompting large language models for zero-shot domain adaptation in speech recognition[J]. arXiv:2306.16007, 2023. [64] LING S, HU Y, QIAN S, et al. Adapting large language model with speech for fully formatted end-to-end speech recognition[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 11046-11050. [65] FATHULLAH Y, WU C, LAKOMKIN E, et al. Prompting large language models with speech recognition abilities[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 13351-13355. [66] WORKSHOP B S, SCAO T L, FAN A, et al. BLOOM: a 176B-parameter open-access multilingual language model[J]. arXiv:2211.05100, 2022. [67] WANG M, HAN W, SHAFRAN I, et al. SLM: bridge the thin gap between speech and text foundation models[J]. arXiv:2310.00230, 2023. [68] MUENNIGHOFF N, WANG T, SUTAWIKA L, et al. Crosslingual generalization through multitask finetuning[C]//Proceedings of the 61st Annual Meeting of The Association For Computational Linguistics, 2023. [69] XUE L, CONSTANT N, ROBERTS A, et al. mT5: a massively multilingual pre-trained text-to-text transformer[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 483-498. [70] WU J, GAUR Y, CHEN Z, et al. On decoder-only architecture for speech-to-text and large language model integration[J]. arXiv:2307.03917, 2023. [71] ZHANG D, LI S, ZHANG X, et al. SpeechGPT: empowering large language models with intrinsic cross-modal conversational abilities[C]//Findings of the Association for Computational Linguistics (EMNLP 2023), 2023: 15757-15773. [72] 任芳慧, 郭熙铜, 彭昕, 等. 医疗领域对话系统口语理解综述[J]. 中文信息学报, 2024, 38(1): 24-35. REN F H, GUO X T, PENG X, et al. A survey of spoken language understanding in medical field[J]. Journal of Chinese Information Processing, 2024, 38(1): 24-35. [73] 邝展鹏. 语音交互设计与研究——以金融自助终端设备语音交互设计为例[D]. 广州: 华南理工大学, 2019. KUANG Z P. Design and research of voice interaction—taking the voice interaction design of financial self-help terminal equipment as an example[D]. Guangzhou: South China University of Technology, 2019. |
[1] | 任海玉, 刘建平, 王健, 顾勋勋, 陈曦, 张越, 赵昌顼. 基于大语言模型的智能问答系统研究综述[J]. 计算机工程与应用, 2025, 61(7): 1-24. |
[2] | 邢素霞, 李珂娴, 方俊泽, 郭正, 赵士杭. 深度学习下的医学图像分割综述[J]. 计算机工程与应用, 2025, 61(7): 25-41. |
[3] | 陈宇, 权冀川. 伪装目标检测:发展与挑战[J]. 计算机工程与应用, 2025, 61(7): 42-60. |
[4] | 翟慧英, 郝汉, 李均利, 占志峰. 铁路设施无人机自主巡检算法研究综述[J]. 计算机工程与应用, 2025, 61(7): 61-80. |
[5] | 韩佰轩, 彭月平, 郝鹤翔, 叶泽聪. DMU-YOLO:机载视觉的多类异常行为检测算法[J]. 计算机工程与应用, 2025, 61(7): 128-140. |
[6] | 史昕, 王浩泽, 纪艺, 马峻岩. 融合时空特征的多模态车辆轨迹预测方法[J]. 计算机工程与应用, 2025, 61(7): 325-333. |
[7] | 王文举, 唐邦, 顾泽骅, 王森. 深度学习的多视角三维重建技术综述[J]. 计算机工程与应用, 2025, 61(6): 22-35. |
[8] | 孙宇, 刘川, 周扬. 深度学习在知识图谱构建及推理中的应用[J]. 计算机工程与应用, 2025, 61(6): 36-52. |
[9] | 陶江垚, 奚雪峰, 盛胜利, 崔志明, 左严. 结构化思维提示增强大语言模型推理能力综述[J]. 计算机工程与应用, 2025, 61(6): 64-83. |
[10] | 侯颖, 胡鑫, 赵瑞瑞, 张楠, 徐艳红, 马莉. 感兴趣区域YOLO_BFROI的扶梯乘客安全检测算法[J]. 计算机工程与应用, 2025, 61(6): 84-95. |
[11] | 李佳静, 李盛, 戴媛媛, 孟涛, 罗小清, 闫宏飞. 融合位置信息和交互注意力的方面级情感分析[J]. 计算机工程与应用, 2025, 61(6): 220-228. |
[12] | 刘红玉, 高见. 融合CBAM的违法犯罪类安卓恶意软件检测与分类模型研究[J]. 计算机工程与应用, 2025, 61(6): 317-327. |
[13] | 洪书颖, 张东霖. 语义信息处理方式分类的车道线检测技术研究综述[J]. 计算机工程与应用, 2025, 61(5): 1-17. |
[14] | 张建伟, 陈旭, 王叔洋, 景永俊, 宋吉飞. 时空图神经网络在物联网中的应用综述[J]. 计算机工程与应用, 2025, 61(5): 43-54. |
[15] | 江双五, 张嘉玮, 华连生, 杨菁林. 基于大模型检索增强生成的气象数据库问答模型实现[J]. 计算机工程与应用, 2025, 61(5): 113-121. |
阅读次数 | ||||||||||||||||||||||||||||||||||||||||||||||
全文 85
|
|
|||||||||||||||||||||||||||||||||||||||||||||
摘要 |
|
|||||||||||||||||||||||||||||||||||||||||||||