Survey of Specific Speech Recognition Algorithms for Dysarthria

doi:10.3778/j.issn.1002-8331.2309-0154

Abstract

Abstract: Articulation disorder, as a medical difficulty, currently mainstream speech recognition technologies are not well adapted to the needs of this field. At the same time, speech recognition technology for dysarthria utilizes a combination of pre training and personalized training to further improve algorithm performance and reduce recognition word error rate through data-driven methods. However, currently, speech recognition technology for dysarthria still has a certain distance from practical commercial use, and its development is limited by data scale and technology. So far, there have been no comprehensive articles on speech recognition for dysarthria. It is urgent to compare and analyze the construction methods and advanced technologies of various datasets in this field, in order to facilitate researchers entering the field to quickly acquire knowledge in this field. This paper conducts a survey on existing datasets, mainstream algorithms, and evaluation methods, and summarizes the scale, form, and characteristics of mainstream speech impairment datasets at home and abroad. It analyzes the mainstream algorithms for speech recognition with dysarthria, and provides the performance and characteristics of different algorithms. Finally, the performance evaluation indicators of the algorithm model based on the severity level of patients with dysarthria are studied, and future research directions are discussed, in order to provide help for the researchers engaged in speech recognition with dysarthria and assist in the rapid development of this field.

Key words: dysarthria, speech recognition, deep learning, artificial intelligence

摘要： 构音障碍作为一种医学难症，目前主流的语音识别技术并不能很好地适应这一领域的需求。同时针对构音障碍的语音识别技术利用预训练及个性化训练相结合的方式，通过数据驱动进一步提升了算法性能，识别字错误率进一步降低，但是目前针对构音障碍的语音识别技术离实际商用还存在一定的距离，该技术的发展受数据规模和技术的限制。到目前为止，尚未出现针对构音障碍语音识别方面的综述文章，亟需将该领域中各种数据集的构建方法和先进技术进行对比分析，以方便进入该领域的研究人员快速获取这方面的知识。对现有数据集、主流算法、评估方式进行了调研，总结了国内外主流构音障碍数据集的规模、形式和特点。分析了构音障碍语音识别的主流算法，并给出了不同算法的性能和特点。最后，研究了基于构音障碍患者的严重等级的算法模型性能评价指标，并讨论了未来的研究方向，以期能够为从事构音障碍语音识别的研究人员提供帮助，助力该领域的快速发展。

关键词: 构音障碍, 语音识别, 深度学习, 人工智能

SONG Wei, ZHANG Yanghao. Survey of Specific Speech Recognition Algorithms for Dysarthria[J]. Computer Engineering and Applications, 2024, 60(11): 62-74.

宋伟, 张杨豪. 构音障碍语音识别算法研究综述[J]. 计算机工程与应用, 2024, 60(11): 62-74.

References

[1] PUROHIT M, PATEL M, MALAVIYA H, et al. Intelligibility improvement of dysarthric speech using mmse discogan[C]//Proceedings of the 2020 International Conference on Signal Processing and Communications (SPCOM), 2020: 1-5.
[2] 《中国脑卒中防治报告2019》编写组. 《中国脑卒中防治报告2019》概要[J]. 中国脑血管病杂志, 2020, 17(5): 272-281.
Report on stroke prevention and treatment in China Writing Group. Brief report on stroke prevention and treatment in China, 2019[J]. Chinese Journal of Cerebrovascular Diseases, 2020, 17(5): 272-281.
[3] 徐莉, 徐明成, 夏逸婷, 等. 针灸治疗缺血性脑卒中构音障碍的疗效观察[J]. 云南中医学院学报, 2017, 40(6): 95-97.
XU L, XU M C, XIA Y T, et al. Observation of acupuncture and moxibustion in the treatment of dysarthria in ischemic stroke[J]. Journal of Yunnan University of Traditional Chinese Medicine, 2017, 40(6): 95-97.
[4] 李阿妮. 失语症患者语音信号的识别研究[D]. 西安: 西安科技大学, 2010.
LI A N. The reserch on recognition of aphasia speech signals[D]. Xi’an: Xi’an University of Science and Technology, 2010.
[5] 张旺. 基于语音识别的功能性构音障碍分析评估研究[D]. 兰州: 兰州理工大学, 2019.
ZHANG W. Analysis and evaluation of functional articulation disorder based on speech recognition[D]. Lanzhou: Lanzhou University of Technology, 2019.
[6] 李山路, 王泳, 甘俊英. 重录语音检测算法[J]. 信号处理, 2017, 33(1): 95-101.
LI S L, WANG Y, GAN J Y. An algorithm of speech recapture detection[J]. Journal of Signal Processing, 2017, 33(1): 95-101.
[7] FILIPPIDOU F, MOUSSIADES L. Α benchmarking of IBM, Google and Wit automatic speech recognition systems[C]//Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations. Cham: Springer, 2020: 73-82.
[8] DE RUSSIS L, CORNO F. On the impact of dysarthric speech on contemporary ASR cloud platforms[J]. Journal of Reliable Intelligent Environments, 2019, 5(3): 163-172.
[9] GREEN J R, MACDONALD R L, JIANG P P, et al. Automatic speech recognition of disordered speech: personalized models outperforming human listeners on short phrases[C]//Proceedings of the Interspeech 2021, 2021: 4778-4782
[10] 刘伟, 谢建志. 语音合成系统中语音库样本能量均衡方法研究[J]. 信号处理, 2017, 33(2): 229-235.
LIU W, XIE J Z. Voice energy balance method for text to speech database[J]. Journal of Signal Processing, 2017, 33(2): 229-235.
[11] DELLER J R, LIU M S, FERRIER L J, et al. The Whitaker database of dysarthric (cerebral palsy) speech[J]. The Journal of the Acoustical Society of America, 1993, 93(6): 3516-3518.
[12] MENENDEZ-PIDAL X, POLIKOFF J B, PETERS S M, et al. The Nemours database of dysarthric speech[C]//Proceedings of the 4th International Conference on Spoken Language Processing, 2002: 1962-1965.
[13] RUDZICZ F, NAMASIVAYAM A K, WOLFF T. The Torgo database of acoustic and articulatory speech from speakers with dysarthria[J]. Language Resources and Evaluation, 2012, 46(4): 523-541.
[14] KIM H, HASEGAWA-JOHNSON M, PERLMAN A, et al. Dysarthric speech database for universal access research[C]//Proceedings of the Interspeech 2008, 2008: 1741-1744.
[15] NICOLAO M, CHRISTENSEN H, CUNNINGHAM S, et al. A framework for collecting realistic recordings of dysarthric speech-the homeservice corpus[C]//Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016: 1993-1997.
[16] 唐以廷. 汉语普通话失语症患者的特定字识别研究[D]. 汕头: 汕头大学, 2020.
TANG Y T. A study on the recognition of specific words in patients with aphasia in Mandarin Chinese[D]. Shantou: Shantou University, 2020.
[17] MARINI M, VIGANò M, CORBO M, et al. IDEA: an Italian dysarthric speech database[C]//Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), 2021: 1086-1093.
[18] TURRISI R, BRACCIA A, EMANUELE M, et al. EasyCall corpus: a dysarthric speech dataset[J]. arXiv:2104.02542, 2021.
[19] MACDONALD R L, JIANG P P, CATTIAU J, et al. Disordered speech data collection: lessons learned at 1 million utterances from project euphonia[C]//Proceedings of the Interspeech 2021, 2021: 4833-4837.
[20] MARIYA CELIN T A, NAGARAJAN T, VIJAYALAKSHMI P. Dysarthric speech corpus in Tamil for rehabilitation research[C]//Proceedings of the 2016 IEEE Region 10 Conference (TENCON), 2017: 2610-2613.
[21] STYLER W. Using Praat for linguistic research[D]. Colorado, America: University of Colorado at Boulder Phonetics Lab, 2013.
[22] XIONG F F, BARKER J, CHRISTENSEN H. Deep learning of articulatory-based representations and applications for improving dysarthric speech recognition[C]//Proceedings of the ITG Symposium on Speech Communication, 2018: 1-5.
[23] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[C]//Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 2011.
[24] XIONG F F, BARKER J, YUE Z J, et al. Source domain data selection for improved transfer learning targeting dysarthric speech recognition[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 7424-7428.
[25] POVEY D, CHENG G F, WANG Y M, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks[C]//Proceedings of the Interspeech 2018. , 2018: 3743-3747.
[26] POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]//Proceedings of the Interspeech 2016, 2016: 2751-2755.
[27] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: a simple data augmentation method for automatic speech recognition[C]//Proceedings of the Interspeech 2019, 2019: 2613-2617.
[28] YU J J, XIE X R, LIU S S, et al. Development of the CUHK dysarthric speech recognition system for the UA speech corpus[C]//Proceedings of the Interspeech 2018, 2018: 2938-2942.
[29] ELSKEN T, METZEN J H, HUTTER F. Neural architecture search[M]//Automated machine learning. Cham: Springer, 2019: 63-77.
[30] LIU S S, GENG M Z, HU S K, et al. Recent progress in the CUHK dysarthric speech recognition system[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2267-2281.
[31] LIU H X, SIMONYAN K, YANG Y M. DARTS: differentiable architecture search[J]. arXiv:1806.09055, 2018.
[32] AFOURAS T, CHUNG J S, SENIOR A, et al. Deep audio-visual speech recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(12): 8717-8727.
[33] 段淑斐, 王俊芹, DINGAM Camille, 等. 基于发音空间特征的构音障碍患者的病情分级[J]. 复旦学报 (自然科学版), 2021, 60(3): 288-296.
DUAN S F, WANG J Q, CAMILLE D, et al. Disease degree classification of dysarthria based on spatial features of articulation[J]. Journal of Fudan University (Natural Science), 2021, 60(3): 288-296.
[34] FRANK R, GRAEME H, PASCAL V L. Vocal tract representation in the recognition of cerebral palsied speech[J]. Journal of Speech, Language, and Hearing Research, 2012, 55(4): 1190-207.
[35] SHI B W, HSU W N, LAKHOTIA K, et al. Learning audio-visual speech representation by masked multimodal cluster prediction[J]. arXiv:2201.02184, 2022.
[36] HU S J, LIU S S, XIE X R, et al. Exploiting cross domain acoustic-to-articulatory inverted features for disordered speech recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: 6747-6751.
[37] YUE Zhengjun, LOWEIMI E, CVETKOVIC Z, et al. Multi-modal acoustic-articulatory feature fusion for dysarthric speech recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: 7372-7376.
[38] KRISHNA G, CARNAHAN M, SHAMAPANT S, et al. Brain signals to rescue aphasia, apraxia and dysarthria speech recognition[C]//Proceedings of the 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2021: 6008-6014.
[39] CHUNG J, GULCEHRE C, CHO K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv:1412.3555, 2014.
[40] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: a simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958.
[41] JACKS A, HALEY K, BISHOP G, et al. Automated speech recognition in adult stroke survivors: comparing human and computer transcriptions[J]. Folia Phoniatrica et Logopaedica, 2019, 71(5/6): 286-296.
[42] 李仕萍, 凌卫新, 陈卓铭, 等. 语言障碍诊断系统的设计及实现[J]. 计算机工程与应用, 2004, 40(30): 191-193.
LI S P, LING W X, CHEN Z M, et al. Design and implementation of the system of languages barrier diagnoses[J]. Computer Engineering and Applications, 2004, 40(30): 191-193.
[43] ENDERBY P. Frenchay dysarthria assessment[J]. International Journal of Language & Communication Disorders, 1980, 15(3): 165-173.
[44] GHIO A, POUCHOULIN G, TESTON B, et al. How to manage sound, physiological and clinical data of 2500 dysphonic and dysarthric speakers?[J]. Speech Communication, 2012, 54(5): 664-679.
[45] HAN M, CHEN F, NI Z, et al. ViLaS: integrating vision and language into automatic speech recognition[J]. arXiv:2305.19972, 2023.
[46] HE Y, SENG K P, ANG L M. Multimodal sensor-input architecture with deep learning for audio-visual speech recognition in wild[J]. Sensors, 2023, 23(4): 1834.
[47] 徐静. 基于声学特征探讨低动力型构音障碍帕金森病的量化评估方法[D]. 广州: 暨南大学, 2020.
XU J. A quantitative evaluation method of Parkinson’s disease with hypodynamic dysarthria based on speech characteristics[D]. Guangzhou: Jinan University, 2020.
[48] MIN Z, WANG J. Exploring the integration of large language models into automatic speech recognition systems: an empirical study[J]. arXiv:2307.06530, 2023.
[49] 马英杰, 陈骥, 帅杰. 基于语音识别的失语症康复治疗仪软件设计与实现[J]. 生物医学工程学杂志, 2006, 23(6): 1343-1346.
MA Y J, CHEN J, SHUAI J. Design and implementation of aphasia rehabilitation software based on speech recognition[J]. Journal of Biomedical Engineering, 2006, 23(6): 1343-1346.