Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (6): 53-63.DOI: 10.3778/j.issn.1002-8331.2405-0145
• Research Hotspots and Reviews • Previous Articles Next Articles
WANG Jingkai, QIN Donghong, BAI Fengbo, LI Lulu, KONG Lingru, XU Chen
Online:
2025-03-15
Published:
2025-03-14
王敬凯,秦董洪,白凤波,李路路,孔令儒,徐晨
WANG Jingkai, QIN Donghong, BAI Fengbo, LI Lulu, KONG Lingru, XU Chen. Review of Research on Fusion Technology of Speech Recognition and Large Language Models[J]. Computer Engineering and Applications, 2025, 61(6): 53-63.
王敬凯, 秦董洪, 白凤波, 李路路, 孔令儒, 徐晨. 语音识别与大语言模型融合技术研究综述[J]. 计算机工程与应用, 2025, 61(6): 53-63.
[1] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019: 4171-4186. [2] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems, 2020: 1877-1901. [3] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551. [4] GULCEHRE C, FIRAT O, XU K, et al. On integrating a language model into neural machine translation[J]. Computer Speech & Language, 2017, 45: 137-148. [5] HOSSEINI-ASL E, MCCANN B, WU C S, et al. A simple language model for task-oriented dialogue[C]//Advances in Neural Information Processing Systems, 2020: 20179-20191. [6] LIU F, LI G, ZHAO Y, et al. Multi-task learning based pre-trained language model for code completion[C]//Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020: 473-485. [7] DIGHE P, SU Y, ZHENG S, et al. Leveraging large language models for exploiting asr uncertainty[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 12231-12235. [8] HUANG R, LI M, YANG D, et al. AudioGPT: understanding and generating speech, music, sound, and talking head[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2024: 23802-23804. [9] FATHULLAH Y, WU C, LAKOMKIN E, et al. Prompting large language models with speech recognition abilities[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 13351-13355. [10] 马晗, 唐柔冰, 张义, 等. 语音识别研究综述[J]. 计算机系统应用, 2022, 31(1): 1-10. MA H, TANG R B, ZHANG Y, et al. Survey on speech recognition[J]. Computer Systems & Applications, 2022, 31(1): 1-10. [11] 王澳回, 张珑, 宋文宇, 等. 端到端流式语音识别研究综述[J]. 计算机工程与应用, 2023, 59(2): 22-33. WANG A H, ZHANG L, SONG W Y, et al. Review of end-to-end streaming speech recognition[J]. Computer Engineering and Applications, 2023, 59(2): 22-33. [12] LI J. Recent advances in end-to-end automatic speech recognition[J]. APSIPA Transactions on Signal and Information Processing, 2022, 11(1). [13] PRABHAVALKAR R, HORI T, SAINATH T N, et al. End-to-end speech recognition: a survey[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 32: 325-351. [14] PANAYOTOV V, CHEN G, POVEY D, et al. Librispeech: an ASR corpus based on public domain audio books[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2015: 5206-5210. [15] GAROFOLO J S, LAMEL L F, FISHER W M, et al. TIMIT acoustic-phonetic continuous speech corpus: LDC93S1[R]. Philadelphia: Linguistic Data Consortium, 1993. [16] BU H, DU J, NA X, et al. Aishell-1: an open-source Mandarin speech corpus and a speech recognition baseline[C]//Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment(OCOCOSDA), 2017: 1-5. [17] DU J, NA X, LIU X, et al. AISHELL-2: transforming Mandarin ASR research into industrial scale[J]. arXiv:1808.10583, 2018. [18] SHI Y, BU H, XU X, et al. AISHELL-3: a multi-speaker Mandarin TTS corpus and the baselines[J]. arXiv:2010.11567, 2020. [19] ARDILA R, BRANSON M, DAVIS K, et al. Common voice: a massively-multilingual speech corpus[C]//Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020: 4218-4222. [20] WANG C, RIVIèRE M, LEE A, et al. VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics(ACL 2021), 2021. [21] PRATAP V, XU Q, SRIRAM A, et al. MLS: a large-scale multilingual dataset for speech research[J]. Interspeech, 2020: 226202134. [22] RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286. [23] GRAVES A, FERNáNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning, 2006: 369-376. [24] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2016: 4945-4949. [25] HAN W, ZHANG Z, ZHANG Y, et al. ContextNet: improving convolutional neural networks for automatic speech recognition with global context[J]. Interspeech, 2020: 3610-3614. [26] DENG K, CAO S, ZHANG Y, et al. Improving CTC-based speech recognition via knowledge transferring from pre-trained language models[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2022: 8517-8521. [27] GRAVES A. Sequence transduction with recurrent neural networks[J]. arXiv:1211.3711, 2012. [28] MORIYA T, ASHIHARA T, TANAKA T, et al. SimpleFlat: a simple whole-network pre-training approach for RNN transducer-based end-to-end speech recognition[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2021: 5664-5668. [29] MORITZ N, HORI T, WATANABE S, et al. Sequence transduction with graph-based supervision[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2022: 7212-7216. [30] CHAN W, JAITLY N, LE Q V, et al. Listen, attend and spell[J]. arXiv:1508.01211, 2015. [31] RAFFEL C, LUONG M T, LIU P J, et al. Online and linear-time attention by enforcing monotonic alignments[C]//Proceedings of the International Conference on Machine Learning, 2017: 2837-2846. [32] CHIU C C, RAFFEL C. Monotonic chunkwise attention[C]//Proceedings of the International Conference on Learning Representations, 2018. [33] FAN R, ZHOU P, CHEN W, et al. An online attention-based model for speech recognition[J]. Interspeech, 2019: 53290672. [34] MORITZ N, HORI T, LE ROUX J. Triggered attention for end-to-end speech recognition[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2019: 5666-5670. [35] INAGUMA H, GAUR Y, LU L, et al. Minimum latency training strategies for streaming sequence-to-sequence ASR[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 6064-6068. [36] WANG C, WU Y, LU L, et al. Low latency end-to-end streaming speech recognition with a scout network[J]. Interspeech, 2020: 218918743. [37] TSUNOO E, KASHIWAGI Y, WATANABE S. Streaming transformer asr with blockwise synchronous beam search[C]//Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT). Piscataway: IEEE, 2021: 22-29. [38] KIM J, KUMAR M, GOWDA D, et al. A comparison of streaming models and data augmentation methods for robust speech recognition[C]//Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway: IEEE, 2021: 989-995. [39] WATANABE S, HORI T, KIM S, et al. Hybrid CTC/attention architecture for end-to-end speech recognition[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1240-1253. [40] 许鸿奎, 张子枫, 卢江坤, 等. 混合CTC/Attention模型在普通话识别中的应用[J]. 计算机与现代化, 2022(8): 1-6. XU H K, ZHANG Z F, LU J K, et al. Application of hybrid CTC/Attention model in mandarin recognition[J]. Computer and Modernization, 2022(8): 1-6. [41] DENG K, CAO S, ZHANG Y, et al. Improving hybrid CTC/Attention end-to-end speech recognition with pretrained acoustic and language models[C]//Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway: IEEE, 2021: 76-82. [42] MIAO H, CHENG G, ZHANG P, et al. Online hybrid CTC/Attention end-to-end automatic speech recognition architecture[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1452-1465. [43] YANG D H, CHANG J H. Attention-based latent features for jointly trained end-to-end automatic speech recognition with modified speech enhancement[J]. Journal of King Saud University-Computer and Information Sciences, 2023, 35(3): 202-210. [44] 金秀丽. 端到端语音识别算法研究与实现[D]. 兰州: 兰州交通大学, 2023. JIN X L. Research and implementation of end?to?end speech recognition algorithm[D]. Lanzhou: Lanzhou Jiaotong University, 2023. [45] ANIL R, DAI A M, FIRAT O, et al. Palm 2 technical report[J]. arXiv:2305.10403, 2023. [46] BORSOS Z, MARINIER R, VINCENT D, et al. AudioLM: a language modeling approach to audio generation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023: 48550. [47] RUBENSTEIN P K, ASAWAROENGCHAI C, NGUYEN D D, et al. Audiopalm: a large language model that can speak and listen[J]. arXiv:2306.12925, 2023. [48] LAKOMKIN E, WU C, FATHULLAH Y, et al. End-to-end speech recognition contextualization with large language models[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 12406-12410. [49] SHU Y, DONG S, CHEN G, et al. LLaSM: large language and speech model[J]. arXiv:2308.15930, 2023. [50] REID M, SAVINOV N, TEPLYASHIN D, et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context[J]. arXiv:2403.05530, 2024. [51] KANNAN A, WU Y, NGUYEN P, et al. An analysis of incorporating an external language model into a sequence-to-sequence model[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 1-5828. [52] SRIRAM A, JUN H, SATHEESH S, et al. Cold fusion: training Seq2Seq models together with language models[J]. Interspeech, 2018: 31004450. [53] GULCEHRE C, FIRAT O, XU K, et al. On using monolingual corpora in neural machine translation[J]. arXiv:1503. 03535, 2015. [54] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9. [55] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2978-2988. [56] SHAN C, WENG C, WANG G, et al. Component fusion: learning replaceable language model component for end-to-end speech recognition system[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2019: 5361-5635. [57] HWANG K, SUNG W. Character-level language modeling with hierarchical recurrent neural networks[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2017: 5720-5724. [58] HORI T, WATANABE S, ZHANG Y, et al. Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM[J]. Interspeech, 2017: 19423475. [59] VARIANI E, RYBACH D, ALLAUZEN C, et al. Hybrid autoregressive transducer (HAT)[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 6139-6143. [60] CHEN T, ALLAUZEN C, HUANG Y, et al. Large-scale language model rescoring on long-form data[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2023: 1-5. [61] HU K, SAINATH T N, LI B, et al. Massively multilingual shallow fusion with large language models[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2023: 1-5. [62] DU N, HUANG Y, DAI A M, et al. GLaM: efficient scaling of language models with mixture-of-experts[C]//Proceedings of the International Conference on Machine Learning, 2022: 5547-5569. [63] LI Y, WU Y, LI J, et al. Prompting large language models for zero-shot domain adaptation in speech recognition[J]. arXiv:2306.16007, 2023. [64] LING S, HU Y, QIAN S, et al. Adapting large language model with speech for fully formatted end-to-end speech recognition[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 11046-11050. [65] FATHULLAH Y, WU C, LAKOMKIN E, et al. Prompting large language models with speech recognition abilities[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2024: 13351-13355. [66] WORKSHOP B S, SCAO T L, FAN A, et al. BLOOM: a 176B-parameter open-access multilingual language model[J]. arXiv:2211.05100, 2022. [67] WANG M, HAN W, SHAFRAN I, et al. SLM: bridge the thin gap between speech and text foundation models[J]. arXiv:2310.00230, 2023. [68] MUENNIGHOFF N, WANG T, SUTAWIKA L, et al. Crosslingual generalization through multitask finetuning[C]//Proceedings of the 61st Annual Meeting of The Association For Computational Linguistics, 2023. [69] XUE L, CONSTANT N, ROBERTS A, et al. mT5: a massively multilingual pre-trained text-to-text transformer[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 483-498. [70] WU J, GAUR Y, CHEN Z, et al. On decoder-only architecture for speech-to-text and large language model integration[J]. arXiv:2307.03917, 2023. [71] ZHANG D, LI S, ZHANG X, et al. SpeechGPT: empowering large language models with intrinsic cross-modal conversational abilities[C]//Findings of the Association for Computational Linguistics (EMNLP 2023), 2023: 15757-15773. [72] 任芳慧, 郭熙铜, 彭昕, 等. 医疗领域对话系统口语理解综述[J]. 中文信息学报, 2024, 38(1): 24-35. REN F H, GUO X T, PENG X, et al. A survey of spoken language understanding in medical field[J]. Journal of Chinese Information Processing, 2024, 38(1): 24-35. [73] 邝展鹏. 语音交互设计与研究——以金融自助终端设备语音交互设计为例[D]. 广州: 华南理工大学, 2019. KUANG Z P. Design and research of voice interaction—taking the voice interaction design of financial self-help terminal equipment as an example[D]. Guangzhou: South China University of Technology, 2019. |
[1] | WANG Wenju, TANG Bang, GU Zehua, WANG Sen. Overview of Multi-View 3D Reconstruction Techniques in Deep Learning [J]. Computer Engineering and Applications, 2025, 61(6): 22-35. |
[2] | SUN Yu, LIU Chuan, ZHOU Yang. Applications of Deep Learning in Knowledge Graph Construction and Reasoning [J]. Computer Engineering and Applications, 2025, 61(6): 36-52. |
[3] | TAO Jiangyao, XI Xuefeng, SHENG Shengli, CUI Zhiming, ZUO Yan. Review on Enhancing Reasoning Abilities of Large Language Model Through Structured Thinking Prompts [J]. Computer Engineering and Applications, 2025, 61(6): 64-83. |
[4] | HOU Ying, HU Xin, ZHAO Ruirui, ZHANG Nan, XU Yanhong, MA Li. Escalator Passenger Safety Detection YOLO_BFROI Algorithm Based on Region of Interest [J]. Computer Engineering and Applications, 2025, 61(6): 84-95. |
[5] | LI Jiajing, LI Sheng, DAI Yuanyuan, MENG Tao, LUO Xiaoqing, YAN Hongfei. Aspect-Level Sentiment Analysis Incorporating Location Information and Interaction Attention [J]. Computer Engineering and Applications, 2025, 61(6): 220-228. |
[6] | LIU Hongyu, GAO Jian. Research on Detection and Classification Model of Illegal and Criminal Android Malware Integrating CBAM [J]. Computer Engineering and Applications, 2025, 61(6): 317-327. |
[7] | HONG Shuying, ZHANG Donglin. Survey on Lane Line Detection Techniques for Classifying Semantic Information Processing Modalities [J]. Computer Engineering and Applications, 2025, 61(5): 1-17. |
[8] | ZHANG Jianwei, CHEN Xu, WANG Shuyang, JING Yongjun, SONG Jifei. Review of Application of Spatiotemporal Graph Neural Networks in Internet of Things [J]. Computer Engineering and Applications, 2025, 61(5): 43-54. |
[9] | JIANG Shuangwu, ZHANG Jiawei, HUA Liansheng, YANG Jinglin. Implementation of Meteorological Database Question-Answering Based on Large-Scale Model Retrieval-Augmentation Generation [J]. Computer Engineering and Applications, 2025, 61(5): 113-121. |
[10] | YU Chengxu, ZHANG Yulai. Research on Deep Learning Backdoor Defense Based on Fine-Tuning [J]. Computer Engineering and Applications, 2025, 61(5): 155-164. |
[11] | LI Xiaotong, MA Sufen, SHENG Hui, WEI Guohui, LI Xintong. Review of Lung CT Image Lesion Region Segmentation Based on Deep Learning [J]. Computer Engineering and Applications, 2025, 61(4): 25-42. |
[12] | XU Chundong, WU Ziyu, GE Fengpei. Review of Speech Recognition Techniques for Low Data Resources [J]. Computer Engineering and Applications, 2025, 61(4): 59-71. |
[13] | DONG Jiadong, GUO Qinghu, CHEN Lin, SANG Feihu. Review on Optimization Algorithms for One-Stage Metal Surface Defect Detection in Deep Learning [J]. Computer Engineering and Applications, 2025, 61(4): 72-89. |
[14] | YUAN Zhongxu, LI Li, HE Fan, YANG Xiu, HAN Dongxuan. Traditional Chinese Medicine Question Answering Model Based on Chain-of-Thought and Knowledge Graph [J]. Computer Engineering and Applications, 2025, 61(4): 158-166. |
[15] | LI Yue, HONG Hailan, LI Wenlin, YANG Tao. Study on Application of Large Language Model in Constructing Knowledge Graph of Medical Cases of Rhinitis [J]. Computer Engineering and Applications, 2025, 61(4): 167-175. |
Viewed | ||||||||||||||||||||||||||||||||||||||||||||||
Full text 67
|
|
|||||||||||||||||||||||||||||||||||||||||||||
Abstract 53
|
|
|||||||||||||||||||||||||||||||||||||||||||||