基于Transformer的单通道语音增强模型综述

doi:10.3778/j.issn.1002-8331.2201-0371

摘要/Abstract

摘要： 深度学习可以有效地解决带噪语音信号与干净语音信号之间复杂的映射问题，改善单通道语音增强的质量，但是增强语音的质量依然不理想。Transformer在语音信号处理领域中已得到了广泛应用，由于集成了多头注意力机制，可以更好地关注语音的长时相关性，该模型可以进一步改善语音增强效果。基于此，回顾了基于深度学习的语音增强模型，归纳了Transformer模型及其内部结构，从不同实现结构出发对基于Transformer的语音增强模型分类，详细分析了几种实例模型。并在常用数据集上对比了Transformer单通道语音增强的性能，分析了它们的优缺点。对相关研究工作的不足进行了总结，并对未来发展进行展望。

关键词: 语音增强, 深度学习, Transformer, 单通道, 多头注意力机制

Abstract: Deep learning can effectively solve the complex mapping problem between noisy speech signals and clean speech signals to improve the quality of single-channel speech enhancement, but the enhancement effect based on network models is not satisfactory. Transformer has been widely used in the field of speech signal processing due to the fact that it integrates multi-headed attention mechanism and can focus on the long-term correlation existing in speech. Based on this, deep learning-based speech enhancement models are reviewed, the Transformer model and its internal structure are summarized, Transformer-based speech enhancement models are classified in terms of different implementation structures, and several example models are analyzed in detail. Furthermore, the performance of Transformer-based single-channel speech enhancement is compared on the public datasets, and their advantages and disadvantages are analyzed. The shortcomings of the related research work are summarized and future developments are envisaged.

Key words: speech enhancement, deep learning, Transformer, single-channel, multi-attention mechanism

范君怡, 杨吉斌, 张雄伟, 郑昌艳. 基于Transformer的单通道语音增强模型综述[J]. 计算机工程与应用, 2022, 58(12): 25-36.

FAN Junyi, YANG Jibin, ZHANG Xiongwei, ZHENG Changyan. Research on Transformer-Based Single-Channel Speech Enhancement[J]. Computer Engineering and Applications, 2022, 58(12): 25-36.

参考文献

[1] WANG W，XING C，WANG D，et al.A robust audio-visual speech enhancement model[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Barcelona，Spain，2020：7529-7533.
[2] GOGATE M，DASHTIPOUR K，BELL P，et al.Deep neural network driven binaural audio visual speech separation[C]//2020 International Joint Conference on Neural Networks（IJCNN），Glasgow，UK，2020：1-7.
[3] LI L，WANG D，ZHENG T F.Neural discriminant analysis for deep speaker embedding[J].arXiv：2005.11905，2020.
[4] 陶智，赵鹤鸣，龚呈卉.基于听觉掩蔽效应和bark子波变换的语音增强[J].声学学报，2005，30（4）：367-372.
TAO Z，ZHAO H M，GONG C H.Speech enhancement based on masking properties of human auditory system and bark wavelet transform[J].Acta Acustica，2005，30（4）：367-372.
[5] ERKELENS J S，HENDRIKS R C，HEUSDENS R，et al.Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors[J].IEEE Transactions on Audio Speech and Language Processing，2007，15（6）：1741-1752.
[6] BORGSTROM B J，ALWAN A.Log-spectral amplitude estimation with generalized gamma distributions for speech enhancement[C]//2011 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Prague，Czech Republic，2011：10.
[7] JU G H，LEE L S.A perceptually constrained gsvd-based approach for enhancing speech corrupted by colored noise[J].IEEE Transactions on Audio Speech and Language Processing，2007，15（1）：119-134.
[8] BOROWICZ A.A signal subspace approach to spatio-temporal prediction for multichannel speech enhancement[J].Eurasip Journal on Audio Speech and Music Processing，2015（1）：1-12.
[9] YAN Q，VASEGHI S，ZAVAREHEI E，et al.Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement[J].Computer Speech and Language，2008，22（1）：69-83.
[10] CHEN R F，CHAN C F，SO H C.Model-based speech enhancement with improved spectral envelope estimation via dynamics tracking[J].IEEE Transactions on Audio Speech and Language Processing，2012，20（4）：1324-1336.
[11] WANG Y X，WANG D L.Towards scaling up classification-based speech separation[J].IEEE Transactions on Audio，Speech，and Language Processing，2013，21（7）：1381-1390.
[12] HEALY E W，YOHO S E，WANG Y X，et al.An algorithm to improve speech recognition in noise for hearing-impaired listeners[J].Journal of the Acoustical Society of America，2013，134（4）：3029-3038.
[13] XU Y，DU J，DAI L，et al.A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2015，23（1）：7-19.
[14] WENINGER F，HERSHEY J R，ROUX J L，et al.Discriminatively trained recurrent neural networks for single-channel speech separation[C]//2014 IEEE Global Conference on Signal and Information Processing（GlobalSIP），Atlanta，GA，USA，2014：577-581.
[15] WENINGER F，ERDOGAN H，WATANABE S，et al.Speech enhancement with LSTM recurrent neural networks and its application to noise-robust asr[C]//Latent Variable Analysis and Signal Separation，Cham，2015：91-99.
[16] PARK S R，LEE J.A fully convolutional neural network for speech enhancement[J].arXiv：1609.07132，2016.
[17] RETHAGE D，PONS J，SERRA X.A wavenet for speech denoising[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Calgary，AB，Canada，2018：5069-5073.
[18] 张天骐，柏浩钧，叶绍鹏，等.基于门控残差卷积编解码网络的单通道语音增强方法[J].信号处理，2021，37（10）：1986-1995.
ZHANG T Q，BAI H J，YE S P，et al.Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network[J].Journal of Signal Processing，2021，37（10）：1986-1995.
[19] STOLLER D，EWERT S，DIXON S.Wave-u-net：a multi-scale neural network for end-to-end audio source separation[J].arXiv：1806.03185，2018.
[20] CHOI H S，KIM J H，HUH J，et al.Phase-aware speech enhancement with deep complex u-net[J].arXiv：1903.
03107，2019.
[21] DEFOSSEZ A，SYNNAEVE G，ADI Y.Real time speech enhancement in the waveform domain[J].arXiv：2006.
12847，2020.
[22] 徐峰，李平.DVUGAN：基于STDCT的DDSP集成变分U-Net的语音增强[J].信号处理，2022，38（3）：582-589.
XU F，LI P.DVUGAN：DDSP integrated variational u-net speech enhancement based on STDCT[J].Journal of Signal Processing，2022，38（3）：582-589.
[23] ZHOU S，DONG L，XU S，et al.A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese[J].arXiv：1805.
06239，2018.
[24] DAI Z，YANG Z，YANG Y，et al.Transformer-xl：attentive language models beyond a fixed-length context[J].arXiv：1901.02860，2019.
[25] CHEN J，LU Y，YU Q，et al.Transunet：transformers make strong encoders for medical image segmentation[J].arXiv：2102.04306，2021.
[26] KIM J，EL-KHAMY M，LEE J.T-gsa：transformer with Gaussian-weighted self-attention for speech enhancement[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Barcelona，Spain，2020：6649-6653.
[27] YU W，ZHOU J，WANG H，et al.Setransformer：speech enhancement transformer[J].Cognitive Computation，2021.
[28] WANG K，HE B，ZHU W P J A E P.Tstnn：two-stage transformer based neural network for speech enhancement in the time domain[J].arXiv：2103.09963，2021.
[29] DANG F，CHEN H，ZHANG P.Dpt-fsnet：dual-path transformer based full-band and sub-band fusion network for speech enhancement[J].arXiv：2104.13002，2021.
[30] 李斌.基于深度神经网络的单通道语音增强方法研究[D].杭州：浙江大学，2020.
LI B.Research on single channel speech enhancement based on deep neural network[D].Hangzhou：Zhejiang University，2020.
[31] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems，Long Beach，California，USA，2017：6000-6010.
[32] SPERBER M，NIEHUES J，NEUBIG G，et al.Self-attentional acoustic models[J].arXiv：1803.09519，2018.
[33] CHEN J，MAO Q，LIU D.Dual-path transformer network：direct context-aware modeling for end-to-end monaural speech separation[J].arXiv：2007.13975，2020.
[34] YEH C F，MAHADEOKAR J，KALGAONKAR K，et al.Transformer-transducer：end-to-end speech recognition with self-attention[J].arXiv：1910.12977，2019.
[35] HUANG W，HU W，YEUNG Y T，et al.Conv-transformer transducer：low latency，low frame rate，streamable end-to-end speech recognition[J].arXiv：2008.05750，2020.
[36] MIAO H，CHENG G，GAO C，et al.Transformer-based online ctc/attention end-to-end speech recognition architecture[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Barcelona，Spain，2020：6084-6088.
[37] O'MALLEY T，NARAYANAN A，WANG Q，et al.A conformer-based asr frontend for joint acoustic echo cancellation，speech enhancement and speech separation[J].arXiv：2111.
09935，2021.
[38] KOIZUMI Y，KARITA S，WISDOM S，et al.Df-conformer：integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhancement[C]//2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics（WASPAA），New Paltz，NY，USA，2021：161-165.
[39] DONG H，PAN J，XIANG L，et al.Multi-scale boosted dehazing network with dense feature fusion[J].arXiv：2004.
13388，2020.
[40] 吕佳，马超，程超.改进的U-Net网络用于视网膜血管分割[J/OL].计算机科学与探索：1-12[2021-12-01].https：//kns.cnki.net/kcms/detail/11.5602.TP.20210825.1945.007.html.
LYU J，MA C，CHENG C.Improved U-Net network for retinal vascular segmentation[J].Journal of Frontiers of Computer Science and Technology：1-12[2021-12-01].https：//kns.cnki.net/kcms/detail/11.5602.TP.20210825.1945.007.html.
[41] VALENTINI-BOTINHAO C，WANG X，TAKAKI S，et al.Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]//9th ISCA Speech Synthesis Workshop，2016：146-152.
[42] VEAUX C，YAMAGISHI J，KING S.The voice bank corpus：design，collection and data analysis of a large regional accent speech database[C]//2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation（O-COCOSDA/CASLRE），Gurgaon，India，2013：1-4.
[43] THIEMANN J，ITO N，VINCENT E.The diverse environments multi-channel acoustic noise database（demand）：a database of multichannel environmental noise recordings[J].Journal of the Acoustical Society of America，2013，133（5）：3591.
[44] SNYDER D，CHEN G，POVEY D.Musan：a music，speech，and noise corpus[J].arXiv：1510.08484，2015.
[45] BOTCHEV V J C R.Speech enhancement：theory and practice（2nd ed.）[J].Computing Reviews，2013，54（10）：604-605.
[46] TAAL C H，HENDRIKS R C，HEUSDENS R，et al.An evaluation of objective measures for intelligibility prediction of time-frequency weighted noisy speech[J].The Journal of the Acoustical Society of America，2011，130（5）：3013-3027.
[47] HU Y，LOIZOU P C.Evaluation of objective quality measures for speech enhancement[J].IEEE Transactions on Audio Speech Language Process，2008，16（1）：229-238.
[48] PASCUAL S，BONAFONTE A，SERRà J.Segan：speech enhancement generative adversarial network[J].arXiv：1703.
09452，2017.
[49] YIN D，LUO C，XIONG Z，et al.Phasen：a phase-and-harmonics-aware speech enhancement network[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：9458-9465.