语音增强与检测的多任务学习方法研究

doi:10.3778/j.issn.1002-8331.2006-0283

摘要/Abstract

摘要：

在许多语音信号处理的实际应用中，都要求系统能够低延迟地实时处理多个任务，并且对噪声要有很强的鲁棒性。针对上述问题，提出了一种语音增强和语音活动检测（Voice Activity Detection，VAD）的多任务深度学习模型。该模型通过引入长短时记忆（Long Short-Term Memory，LSTM）网络，构建了一个适合于实时在线处理的因果系统。基于语音增强和VAD的强相关性，该模型以硬参数共享的方式连接了两个任务的输出层，不仅减少了计算量，还通过多任务学习提高了任务的泛化能力。实验结果表明，相较串行处理两个任务的基线模型，多任务模型在语音增强结果非常相近、VAD结果更优的情况下，其速度快了44.2%，这对于深度学习模型的实际应用和部署将具有重要的意义。

关键词: 多任务学习, 深度学习, 语音增强, 语音活动检测

Abstract:

In many real-world applications of speech signal processing, real-time multi-task processing with low latency and strong robustness to noise is highly required. To solve the problem, a multi-task deep learning model of speech enhancement and Voice Activity Detection（VAD） is proposed. This model constructs a causal system suitable for real-time online processing by introducing a Long Short-Term Memory（LSTM） network. Based on the strong correlation between speech enhancement and VAD, the output layers of two tasks are connected using hard parameter sharing which lead a reduction of the number of parameters and an improvement of generalization ability of tasks through multi-task learning. Experimental results show that, processing speed of multi-task model improves considerably to 44.2% compared with the serial processing of baseline models with similar speech enhancement results and better VAD results, which is a great significance for the application and deployment of the deep learning model.

Key words: multi-task learning, deep learning, speech enhancement, voice activity detection

王师琦，曾庆宁，龙超，熊松龄，祁潇潇. 语音增强与检测的多任务学习方法研究[J]. 计算机工程与应用, 2021, 57(20): 197-202.

WANG Shiqi, ZENG Qingning, LONG Chao, XIONG Songling, QI Xiaoxiao. Multi-task Learning for Speech Enhancement and Detection[J]. Computer Engineering and Applications, 2021, 57(20): 197-202.

参考文献

[1] WANG D L，CHEN J.Supervised speech separation based on deep learning：an overview[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2018，26（10）：1702-1726.
[2] WANG Y，WANG D.Towards scaling up classification-based speech separation[J].IEEE Transactions on Audio Speech and Language Processing，2013，21（7）：1381-1390.
[3] DELFARAH M，WANG D L.Features for masking-based monaural speech separation in reverberant conditions[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2017，25（5）：1085-1094.
[4] HUI L，CAI M，GUO C，et al.Convolutional maxout neural networks for speech separation[C]//2015 IEEE International Symposium on Signal Processing and Information Technology（ISSPIT），2015.
[5] CHEN J，WANG D L.Long short-term memory for speaker generalization in supervised speech separation[J].Journal of the Acoustical Society of America，2017，141（6）：4705-4714.
[6] SUN L，SU M，YANG Z.An adaptive speech endpoint detection method in low SNR environments[J].International Journal of Speech Technology，2017，20（3）：651-658.
[7] SOHN J，KIM N S，SUNG W.A statistical model-based voice activity detection[J].IEEE Signal Processing Letters，1999，6（1）：1-3.
[8] ZHANG X L，WANG D L.Boosting contextual information for deep neural network based voice activity detection[J].IEEE/ACM Transactions on Audio Speech & Language Processing，2016，24（2）：252-264.
[9] ZHUANG Y，TONG S，YIN M，et al.Multi-task joint-learning for robust voice activity detection[C]//International Symposium on Chinese Spoken Language Processing，2016.
[10] ROMERO A，BALLAS N，KAHOU S E，et al.FitNets：hints for thin deep nets[J].arXiv：1412.6550，2014.
[11] BHAT G S，SHANKAR N，REDDY C K A，et al.A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone[J].IEEE Access，2019，7：78421-78433.
[12] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[13] ZHANG Y，YANG Q.A survey on multi-task learning[J].arXiv：1707.08114，2017.
[14] XU T J，ZHANG H，ZHANG X L.Joint training ResCNN-based voice activity detection with speech enhancement[C]//2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference（APSIPA ASC），2019.
[15] WANG Y，NARAYANAN A，WANG D L.On training targets for supervised speech separation[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2014，22（12）：1849-1858.
[16] HEYMANN J，DRUDE L，HAEB-UMBACH R.Neural network based spectral mask estimation for acoustic beamforming[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2016.
[17] GAROFOLO J S，LAMEL L F，FISHER W M，et al.TIMIT acoustic-phonetic continuous speech corpus[C]//Linguistic Data Consortium，1993.
[18] VARGA A，STEENEKEN H J M.Assessment for automatic speech recognition：II.NOISEX-92：a database and an experiment to study the effect of additive noise on speech recognition systems[J].Speech Communication，1993，12（3）：247-251.
[19] THIEMANN J，ITO N，VINCENT E.The diverse environments multi-channel acoustic noise database：a database of multichannel environmental noise recordings[J].Journal of the Acoustical Society of America，2013，133：3591.
[20] TAAL C H，HENDRIKS R C，HEUSDENS R，et al.An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J].IEEE Transactions on Audio，Speech and Language Processing，2011，19（7）：2125-2136.
[21] RIX A W，BEERENDS J G，HOLLIER M P，et al.Perceptual evaluation of speech quality（PESQ）：a new method for speech quality assessment of telephone networks and codecs[C]//2001 IEEE International Conference on Acoustics，Speech，and Signal Processing，2001.