
计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (4): 59-71.DOI: 10.3778/j.issn.1002-8331.2405-0425
许春冬,吴子煜,葛凤培
出版日期:2025-02-15
发布日期:2025-02-14
XU Chundong, WU Ziyu, GE Fengpei
Online:2025-02-15
Published:2025-02-14
摘要: 近年来,自动语音识别的研究重心由传统识别方法转向基于深度学习的语音识别方法。“大模型”现象反映出深度学习方法的性能随着训练数据量的增加呈现显著上升的趋势。然而,现实环境的复杂性、语音数据分布的非均匀性和用户隐私的保护等因素给数据的收集造成困难。同时,语音数据的标注需要大量专业人员的参与,导致标注成本很高。因此,语音识别在实际应用中经常面临数据资源不足的问题。在这种低数据资源条件下构建性能优异且稳定的语音识别系统仍是研究难点。简单归纳了语音识别的发展历程,总结了语音识别的基本框架以及常见的国内外开源数据集。围绕低数据资源问题,详细分析了低数据资源的判定方法,继而梳理了四类技术方案,包括数据增强、联邦学习、自监督学习以及元学习,并对它们的性能状况以及优缺点进行了系统的剖析。最后讨论了该研究方向未来潜在的发展趋势和可能面临的问题。
许春冬, 吴子煜, 葛凤培. 面向低数据资源的语音识别研究综述[J]. 计算机工程与应用, 2025, 61(4): 59-71.
XU Chundong, WU Ziyu, GE Fengpei. Review of Speech Recognition Techniques for Low Data Resources[J]. Computer Engineering and Applications, 2025, 61(4): 59-71.
| [1] WANG D, WANG X D, LV S H. An overview of end-to-end automatic speech recognition[J]. Symmetry, 2019, 11(8): 1018. [2] PRANJOL M A, RAHMAN F, RHYTHM E R, et al. Bengali speech recognition: an overview[C]//Proceedings of the 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology. Piscataway: IEEE, 2022: 1-6. [3] 马晗, 唐柔冰, 张义, 等. 语音识别研究综述[J]. 计算机系统应用, 2022, 31(1): 1-10. MA H, TANG R B, ZHANG Y, et al. Survey on speech recognition[J]. Computer Systems & Applications, 2022, 31(1): 1-10. [4] 王澳回, 张珑, 宋文宇, 等. 端到端流式语音识别研究综述[J]. 计算机工程与应用, 2023, 59(2): 22-33. WANG A H, ZHANG L, SONG W Y, et al. Review of end-to-end streaming speech recognition[J]. Computer Engineering and Applications, 2023, 59(2): 22-33. [5] DU W Q, MAIMAITIYIMING Y, NIJAT M, et al. Automatic speech recognition for Uyghur, Kazakh, and Kyrgyz: an overview[J]. Applied Sciences, 2022, 13(1): 326. [6] OJHA A, PAL R, BATRA R. An overview of methodologies and prototypes involved in speech recognition systems[C]//Proceedings of the 2023 International Conference on Advances in Computation, Communication and Information Technology. Piscataway: IEEE, 2023: 665-669. [7] DAVIS K H, BIDDULPH R, BALASHEK S. Automatic recognition of spoken digits[J]. The Journal of the Acoustical Society of America, 1952, 24(6): 637-642. [8] OLSON H F, BELAR H. Phonetic typewriter[J]. The Journal of the Acoustical Society of America, 1956, 28(6): 1072-1081. [9] FORGIE J W, FORGIE C D. Results obtained from a vowel recognition computer program[J]. The Journal of the Acoustical Society of America, 1959, 31(11): 1480-1489. [10] SUZUKI J, NAKATA K. Recognition of Japanese vowels—preliminary to the recognition of speech[J]. Radio Research Laboratory, 1961, 37(1): 193-212. [11] SAKAI T, DOSHITA S. Phonetic typewriter[J]. The Journal of the Acoustical Society of America, 1961, 33(11): 1664. [12] NAGATA K. Spoken digit recognizer for Japanese language[J]. Journal of the Audio Engineering Society, 1964, 12(4): 336-342. [13] ITAKURA F, SAITO S. A statistical method for estimation of speech spectral density and formant frequencies[J]. Electronics and Communications in Japan, 1970, 53(1): 36-43. [14] VINTSYUK T K. Speech discrimination by dynamic programming[J]. Cybernetics and Systems Analysis, 1968, 4(1): 52-57. [15] SAKOE H, CHIBA S. Dynamic programming algorithm optimization for spoken word recognition[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978, 26(1): 43-49. [16] BAUM L E, An inequality and associated maximization technique in statistical estimation of probabilistic functions of Markov processes[J]. Inequalities, 1972, 3(1): 1-8. [17] LEE K F. On large-vocabulary speaker-independent continuous speech recognition[J]. Speech Communication, 1988, 7(4): 375-379. [18] GALES M, YOUNG S. The application of hidden Markov models in speech recognition[J]. Foundations and Trends in Signal Processing, 2007, 1(3): 195-304. [19] SCHALKOFF R J. Artificial neural networks[M]. New York: McGrawHill, 1997. [20] HINTON G E, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527-1554. [21] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 30-42. [22] EYBEN F, W?LLMER M, SCHULLER B, et al. From speech to letters-using a novel neural network architecture for grapheme based ASR[C]//Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. Piscataway: IEEE, 2009: 376-380. [23] FUJITA Y, WATANABE S, CHANG X K, et al. LV-CTC: non-autoregressive ASR with CTC and latent variable models[C]//Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway: IEEE, 2023: 1-6. [24] ZHOU J M, ZHAO S W, LIU Y Q, et al. KNN-CTC: enhancing ASR via retrieval of CTC pseudo labels[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 11006-11010. [25] GRAVES A. Sequence transduction with recurrent neural networks[J]. arXiv:1211.3711, 2012. [26] HWANG D, SIM K C, ZHANG Y, et al. Comparison of soft and hard target RNN-T distillation for large-scale ASR[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5. [27] SHANGGUAN Y, YANG H, LI D, et al. TODM: train once deploy many efficient supernet-based RNN-T compression for on-device ASR models[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 10216-10220. [28] TRIPATHI K, RAO K S. CycleGAN-based speech mode transformation model for robust multilingual ASR[J]. Circuits, Systems, and Signal Processing, 2022, 41(9): 5283-5305. [29] GUO T, YOLWAS N, SLAMU W. Efficient conformer for agglutinative language ASR model using low-rank approximation and balanced softmax[J]. Applied Sciences, 2023, 13(7): 4642. [30] WANG F Y, XU B, XU B. SSCFormer: push the limit of chunk-wise conformer for streaming ASR using sequentially sampled chunks and chunked causal convolution[J]. IEEE Signal Processing Letters, 2024, 31: 421-425. [31] 洪青阳, 李琳. 语音识别: 原理与应用[M]. 2版. 北京: 电子工业出版社, 2023. HONG Q Y, LI L. Principle and application of speech recognition[M]. 2nd ed. Beijing: Publishing House of Electronics Industry, 2023. [32] SAINATH T N, VINYALS O, SENIOR A, et al. Convolutional, long short-term memory, fully connected deep neural networks[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2015: 4580-4584. [33] GAROFOLO J S, LAMEL L F, FISHER W M, et al. DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM[R]. 1993. [34] ROUSSEAU A, DELéGLISE P, ESTèVE Y. TED-LIUM: an automatic speech recognition dedicated corpus[C]//Proceedings of the 2012 International Conference on Language Resources and Evaluation, 2012: 125-129. [35] PANAYOTOV V, CHEN G, POVEY D, et al. Librispeech: an ASR corpus based on public domain audio books[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2015: 5206-5210. [36] WANG D, ZHANG X. THCHS-30: a free Chinese speech corpus[J]. arXiv:1512.01882, 2015. [37] BU H, DU J, NA X, et al. AISHELL-1: an open-source Mandarin speech corpus and a speech recognition baseline[C]//Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Piscataway: IEEE, 2017: 1-5. [38] DU J, NA X, LIU X, et al. AISHELL-2: transforming Mandarin ASR research into industrial scale[J]. arXiv:1808.10583, 2018. [39] PRATAP V, XU Q, SRIRAM A, et al. MLS: a large-scale mul-tilingual dataset for speech research[J]. arXiv:2012. 03411, 2020. [40] KAHN J, RIVIèRE M, ZHENG W, et al. Libri-light: a benchmark for ASR with limited or No supervision[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 7669-7673. [41] ZHONG G L, SONG H Y, WANG R Y, et al. External text based data augmentation for low-resource speech recognition in the constrained condition of OpenASR21 challenge[C]//Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022: 4860-4864. [42] WANG B L, HU W X, LI J, et al. OLR 2021 challenge: datasets, rules and baselines[J]. arXiv:2107.11113, 2021. [43] 周岩. 面向语音识别应用的数据增强技术研究[D]. 北京: 北方工业大学, 2022. ZHOU Y. Research on data enhancement technology for speech recognition application[D]. Beijing: North China University of Technology, 2022. [44] JAITLY N, HINTON G E. Vocal tract length perturbation (VTLP) improves speech recognition[C]//Proceedings of the 30th International Conference on Machine Learning, 2013. [45] KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015: 3586-3589. [46] PEDDINTI V, CHEN G, POVEY D, et al. Reverberation robust acoustic modeling using i-vectors with time delay neural networks[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association, 2015: 2440-2444. [47] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: a simple data augmentation method for automatic speech recognition[J]. arXiv:1904.08779, 2019. [48] ZHANG Y, QIN J, HAN W, et al. Pushing the limits of semi-supervised learning for automatic speech recognition[J]. arXiv:2010.10504, 2020. [49] PARK D S, ZHANG Y, JIA Y, et al. Improved noisy student training for automatic speech recognition[J]. arXiv:2005.09629, 2020. [50] DUAN Y, REN J, YU H, et al. GAN-in-GAN for monaural speech enhancement[J]. IEEE Signal Processing Letters, 2023, 30: 853-857. [51] GUO H M, JIAN H F, WANG Y Q, et al. MAMGAN: multiscale attention metric GAN for monaural speech enhancement in the time domain[J]. Applied Acoustics, 2023, 209: 109385. [52] BAAS M, KAMPER H. Disentanglement in a GAN for unconditional speech synthesis[J]. arXiv:2307.01673, 2023. [53] 王洪彬, 孙佳琦, 马志强, 等. 一种基于双判别器生成对抗网络的蒙古语语音识别方法: CN116564276A[P]. 2023-08-08. WANG H B, SUN J Q, MA Z Q, et al. A Mongolian speech recognition method based on dual discriminator generative adversarial networks: CN116564276A[P]. 2023-08-08. [54] HAIDAR M A, REZAGHOLIZADEH M. Fine-tuning of pre-trained end-to-end speech recognition with generative adversarial networks[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6204-6208. [55] JIN Z R, GENG M Z, DENG J J, et al. Personalized adversarial data augmentation for dysarthric and elderly speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 32: 413-429. [56] YU J W, XIE X R, LIU S S, et al. Development of the CUHK dysarthric speech recognition system for the UA speech corpus[C]//Proceedings of the 19th Annual Conference of the International Speech Communication Association, 2018: 2938-2942. [57] WANG H M, JIN Z R, GENG M Z, et al. Enhancing pre-trained ASR system fine-tuning for dysarthric speech recognition using adversarial data augmentation[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 12311-12315. [58] 李少波, 杨磊, 李传江, 等. 联邦学习概述: 技术、应用及未来[J]. 计算机集成制造系统, 2022, 28(7): 2119-2138. LI S B, YANG L, LI C J, et al. Overview of federated learning: technology, applications and future[J]. Computer Integrated Manufacturing Systems, 2022, 28(7): 2119-2138. [59] LI T, SANJABI M, BEIRAMI A, et al. Fair resource allocation in federated learning[J]. arXiv:1905.10497, 2019. [60] MCMAHAN H B, MOORE E, RAMAGE D. Communication-efficient learning of deep net-works from decentralized data[C]//Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. [61] TSOUVALAS V, SAEED A, OZCELEBI T. Federated self-training for data-efficient audio recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 476-480. [62] NGUYEN T, MDHAFFAR S, TOMASHENKO N, et al. Federated learning for ASR based on wav2vec 2.0[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5. [63] GAO Y, PARCOLLET T, ZAIEM S, et al. End-to-end speech recognition from federated acoustic models[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 7227-7231. [64] RAMESH G V, CHENNUPATI G, RAO M, et al. Federated representation learning for automatic speech recognition[J]. arXiv:2308.02013, 2023. [65] GULIANI D, BEAUFAYS F, MOTTA G. Training speech recognition models with federated learning: a quality/cost framework[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 3080-3084. [66] CHUNG Y A, HSU W N, TANG H, et al. An unsupervised autoregressive model for speech representation learning[J]. arXiv:1904.03240, 2019. [67] HSU W N, BOLTE B, TSAI Y H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021, 29: 3451-3460. [68] CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505-1518. [69] MOHAMED A, LEE H Y, BORGHOLT L, et al. Self-supervised speech representation learning: a review[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1179-1210. [70] CHUNG Y A, TANG H, GLASS J. Vector-quantized auto- regressive predictive coding[J]. arXiv:2005.08392, 2020. [71] VAN DEN OORD A, LI Y, VINYALS O, et al. Representation learning with contrastive predictive coding[J]. arXiv:1807.03748, 2018. [72] BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations[J]. arXiv:2006.11477, 2020. [73] ZHAO J, ZHANG W Q. Improving automatic speech recognition performance for low-resource languages with self-supervised models[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1227-1241. [74] 高长丰, 程高峰, 张鹏远. 面向鲁棒自动语音识别的一致性自监督学习方法[J]. 声学学报, 2023, 48(3): 578-587. GAO C F, CHENG G F, ZHANG P Y. Consistency self-supervised learning method for robust automatic speech recognition[J]. Acta Acustica, 2023, 48(3): 578-587. [75] PENG Y F, SUDO Y, MUHAMMAD S, et al. DPHuBERT: joint distillation and pruning of self-supervised speech models[J]. arXiv:2305.17651, 2023. [76] FATEHI K, KUCUKYILMAZ A. LABERT: a combination of local aggregation and self-supervised speech representation learning for detecting informative hidden units in low-resource ASR systems[C]//Proceedings of the 24th Annual Conference of the International Speech Communication Association, 2023: 211-215. [77] ZHANG J S, SU T T, WANG G, et al. Self-supervised learning with explorative knowledge distillation[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5. [78] ZAIEM S, KEMICHE Y, PARCOLLET T, et al. Speech self-supervised representations benchmarking: a case for larger probing heads[J]. Computer Speech & Language, 2024, 89: 101695. [79] 张传尧, 司世景, 王健宗, 等. 联邦元学习综述[J]. 大数据, 2023, 9(2): 122-146. ZHANG C Y, SI S J, WANG J Z, et al. Federated meta learning: a review[J]. Big Data Research, 2023, 9(2): 122-146. [80] FINN C, ABBEEL P, LEVINE S, et al. Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning, 2017: 1126-1135. [81] HOU W X, WANG Y D, GAO S Z, et al. Meta-adapter: efficient cross-lingual adaptation with meta-learning[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 7028-7032. [82] SINGH S, WANG R, HOU F. Improved meta learning for low resource speech recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 4798-4802. [83] WANG Q L, HU W X, LI L, et al. Meta learning with adaptive loss weight for low-resource speech recognition[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5. [84] CHEN Y Q, ZHANG W L, ZHANG H, et al. Task-based meta focal loss for multilingual low-resource speech recognition[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023, 22(11): 244. [85] ANOOP C S, RAMAKRISHNAN A G. Meta-learning for Indian languages: performance analysis and improvements with linguistic similarity measures[J]. IEEE Access, 2023, 11: 82050-82064. [86] MA P C, PETRIDIS S, PANTIC M. Visual speech recognition for multiple languages in the wild[J]. Nature Machine Intelligence, 2022, 4: 930-939. [87] ZHANG T H, QIN H B, LAI Z H, et al. Rethinking speech recognition with a multimodal perspective via acoustic and semantic cooperative decoding[J]. arXiv:2305.14049, 2023. [88] LI J H, LI C D, WU Y F, et al. Unified cross-modal attention: robust audio-visual speech recognition and beyond[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 1941-1953. |
| [1] | 于信, 马廷淮, 彭可兴, 贾莉, 蒋永溢. 适用于多领域少样本的元适配器整合学习方法[J]. 计算机工程与应用, 2025, 61(5): 122-133. |
| [2] | 王佳恬, 李凡长. 一种线性迁移元学习算法的研究[J]. 计算机工程与应用, 2025, 61(5): 177-186. |
| [3] | 王炜航, 张轶. MLDAC:多任务密集注意计算自监督小样本分割方法[J]. 计算机工程与应用, 2025, 61(4): 211-221. |
| [4] | 陈鹏, 邓淼磊, 樊好义, 张德贤, 韩涵. 心电特征引导下的自监督房颤异常检测方法[J]. 计算机工程与应用, 2025, 61(2): 208-218. |
| [5] | 李晓益, 胡滨, 秦进, 彭安浪. 结合元学习和安全区域探索的进化强化学习方法[J]. 计算机工程与应用, 2025, 61(1): 361-367. |
| [6] | 周伯俊, 陈峙宇. 基于深度元学习的小样本图像分类研究综述[J]. 计算机工程与应用, 2024, 60(8): 1-15. |
| [7] | 胡志强, 李朋骏, 王金龙, 熊晓芸. 基于ChatGPT增强和监督对比学习的政策工具归类研究[J]. 计算机工程与应用, 2024, 60(7): 292-305. |
| [8] | 王磊, 杨军, 张驰宇, 代在燕. 结合混合注意力的双判别生成对抗网络[J]. 计算机工程与应用, 2024, 60(7): 212-221. |
| [9] | 李华超, 康彬, 王磊. 常识辅助细粒度数据增强方法[J]. 计算机工程与应用, 2024, 60(6): 214-221. |
| [10] | 李嘉信, 胡杨, 黄协舟, 李洪均. 面向小目标的多空间层次安全帽检测[J]. 计算机工程与应用, 2024, 60(6): 230-237. |
| [11] | 张世文, 陈双, 梁伟, 李仁发. 联邦学习中的攻击手段与防御机制研究综述[J]. 计算机工程与应用, 2024, 60(5): 1-16. |
| [12] | 宋雨, 王帮海, 曹钢钢. 结合数据增强与特征融合的跨模态行人重识别[J]. 计算机工程与应用, 2024, 60(4): 133-141. |
| [13] | 张多纳, 赵宏佳, 鲁远耀, 崔健, 张宝昌. 融入注意力机制的小样本遥感图像场景分类[J]. 计算机工程与应用, 2024, 60(4): 173-182. |
| [14] | 章钧津, 田永红, 宋哲煜, 郝宇峰. 神经机器翻译综述[J]. 计算机工程与应用, 2024, 60(4): 57-74. |
| [15] | 段昕汝, 陈桂茸, 陈爱网, 陈晨, 姬伟峰. 联邦学习中的信息安全问题研究综述[J]. 计算机工程与应用, 2024, 60(3): 61-77. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||