两级U-Net波束形成网络的3D语音增强算法

doi:10.3778/j.issn.1002-8331.2207-0352

摘要/Abstract

摘要： 3D混响环境中的噪声对很多下游应用不利，开发适用于现实相近场景的3D语音增强技术，在实际生活中具有重要的理论意义和实用价值。针对此场景提出了一种用于3D语音增强的两级波束形成网络。该网络由两个连续的多输入单输出U-Net波束形成网络组成。第一级网络主要对来自双麦克风的3D语音信号进行波束形成粗估计，滤除部分信号噪声。为进一步改进估计，第二级网络则将粗估计信号的特征连同原始信号内全向信道信息特征作为输入，进行波束形成细估计，以得到更精确的估计信号，达到两级增强的目的。数据集和实验基于L3DAS22挑战赛的3D语音增强任务。该方法在盲测试集上获得的短时客观可懂度（short-time objective intelligibility，STOI）为0.925，字错误率（word error rate，WER）达到13.6%，明显优于L3DAS21 3D语音增强挑战赛中的冠军模型（0.878和21.2%）。

关键词: 语音增强, 3D语音信号, 深度学习, 波束形成

Abstract: The noise in the 3D reverberation environment is detrimental to many downstream applications. The development of 3D speech enhancement technology suitable for realistic scenes has important theoretical significance and practical value in real life. This paper proposes a two-stage beamforming network for 3D speech enhancement in this scenario. The network consists of two consecutive multiple-input single-output U-Net beamforming networks. The first-level network mainly performs rough beamforming estimation on the 3D speech signal from the dual microphones, and filters out part of the signal noise. In order to further improve the estimation, the second-level network takes the characteristics of the rough estimated signal together with the characteristics of the omnidirectional channel information in the original signal as input, and performs the beamforming fine estimation to obtain a more accurate estimated signal and achieve the purpose of two-level enhancement. The dataset and experiments are based on the 3D speech enhancement task of the L3DAS22 challenges. The short-time objective intelligibility （STOI） obtained by this method on the blind test set is 0.925, and the word error rate （WER） reaches 13.6%, which is significantly better than the L3DAS21 3D speech enhancement challenge, the champion model in the competition （0.878 and 21.2%）.

Key words: speech enhancement, 3D speech signal, deep learning, beamforming

林文模, 陈飞龙, 孙成立, 朱祯君. 两级U-Net波束形成网络的3D语音增强算法[J]. 计算机工程与应用, 2023, 59(22): 128-135.

LIN Wenmo, CHEN Feilong, SUN Chengli, ZHU Zhenjun. 3D Speech Enhancement Algorithm for Two-Stage U-Net Beamforming Network[J]. Computer Engineering and Applications, 2023, 59(22): 128-135.

参考文献

[1] HELFER K S，WILBER L A.Hearing loss，aging，and speech perception in reverberation and noise[J].Journal of Speech，Language，and Hearing Research，1990，33（1）：149-155.
[2] HARRIS R W，SWENSON D W.Effects of reverberation and noise on speech recognition by adults with various amounts of sensorineural hearing impairment[J].Audiology，1990，29（6）：314-321.
[3] NABELEK A K.Communication in noisy and reverberant environments[J].Acoustical Factors Affecting Hearing Aid Performance，1993：15-28.
[4] EDWARDS B.The future of hearing aid technology[J].Trends in Amplification，2007，11（1）：31-45.
[5] GELBART D，MORGAN N.Double the trouble：handling noise and reverberation in far-field automatic speech recognition[C]//Proceedings of the 7th International Conference on Spoken Language Processing-INTERSPEECH 2002，2002：2185-2188.
[6] LI J，DENG L，GONG Y，et al.An overview of noise-robust automatic speech recognition[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2014，22（4）：745-777.
[7] AL-KARAWI K A，AL-NOORI A H，LI F F，et al.Automatic speaker recognition system in adverse conditions—implication of noise and reverberation on system performance[J].International Journal of Information and Electronics Engineering，2015，5（6）：423-427.
[8] 范君怡，杨吉斌，张雄伟，等.基于Transformer的单通道语音增强模型综述[J].计算机工程与应用，2022，58（12）：25-36.
FAN J Y，YANG J B，ZHANG X W，et al.Research on transformer-based single-channel speech enhancement[J].Computer Engineering and Applications，2022，58（12）：25-36.
[9] 邓贺元，刘加，夏善红，等.一种联合频谱和空间特征的深度学习多通道语音增强算法[J].电子测量技术，2019，42（18）：90-94.
DENG H Y，LIU J，XIA S H，et al.Combining spectral and spatial features for deep learning based multi-channel speech enhancement[J].Electronic Measurement Technology，2019，42（18）：90-94.
[10] 柯雨璇，厉剑，彭任华，等.用于自适应波束形成语音增强的球谐域掩蔽函数估计方法[J].声学学报，2021，46（1）：67-80.
KE Y X，LI J，PENG R H，et al.Mask estimation method in the spherical harmonic domain used by adaptive beamforming for speech enhancement[J].Acta Acustica，2021，46（1）：67-80.
[11] WANG D L，CHEN J.Supervised speech separation based on deep learning：an overview[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2018，26（10）：1702-1726.
[12] HEYMANN J，DRUDE L，HAEB-UMBACT R.Neural network based spectral mask estimation for acoustic beamforming[C]//Proceedings of the 2016 IEEE International Conference on Acoustics，Speech and Signal Processing，2016：196-200.
[13] 王师琦，曾庆宁，龙超，等.语音增强与检测的多任务学习方法研究[J].计算机工程与应用，2021，57（20）：197-202.
WANG S Q，ZENG Q N，LONG C，et al.Multi-task learning for speech enhancement and detection[J].Computer Engineering and Applications，2021，57（20）：197-202.
[14] OCHIAI T，DELCROIX M，IKESHITA R，et al.Beam-TasNet：time-domain audio separation network meets frequency-domain beamformer[C]//Proceedings of the 2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：6384-6388.
[15] GUIZZO E，MARINONI C，PENNESE M，et al.L3DAS22 challenge：learning 3D audio sources in a real office environment[C]//Proceedings of the 2022 IEEE International Conference on Acoustics，Speech and Signal Processing，2022：9186-9190.
[16] GUIMARAES H R，BECCARO W，RAMIREZ M A.Optimizing time domain fully convolutional networks for 3D speech enhancement in a reverberant environment using perceptual losses[C]//Proceedings of the 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing，2021：1-6.
[17] REN X，CHEN L，ZHENG X，et al.A neural beamforming network for B-format 3D speech enhancement and recognition[C]//Proceedings of the 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing，2021：1-6.
[18] LI J，ZHU Y，LUO D，et al.The PCG-AIID system for L3DAS22 challenge：MIMO and MISO convolutional recurrent network for multi channel speech enhancement and speech recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics，Speech and Signal Processing，2022：9211-9215.
[19] LU Y J，CORNELL S，CHANG X，et al.Towards low-distortion multi-channel speech enhancement：the ESPNET-SE submission to the L3DAS22 challenge[C]//Proceedings of the 2022 IEEE International Conference on Acoustics，Speech and Signal Processing，2022：9201-9205.
[20] MALHAM D G，MYATT A.3-D sound spatialization using ambisonic techniques[J].Computer Music Journal，1995，19（4）：58-70.
[21] 任健，李鸿燕，张昱，等.基于UNet自适应特征融合的语音增强[J].电子测量技术，2022，45（9）：76-81.
REN J，LI H Y，ZHANG Y，et al.Speech enhancement based on UNet adaptive feature fusion[J].Electronic Measurement Technology，2022，45（9）：76-81.
[22] PANAYOTOYV，CHEN G，POVEY D，et al.Librispeech：an ASR corpus based on public domain audio books[C]//Proceedings of the 2015 IEEE International Conference on Acoustics，Speech and Signal Processing，2015：5206-5210.
[23] FONSECA E，FAVORY X，PONS J，et al.FSD50k：an open dataset of human-labeled sound events[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2021，30：829-852.
[24] BAEVSKI A，ZHOU Y，MOHAMED A，et al.wav2vec 2.0：a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems 33，2020：12449-12460.
[25] LUO Y，CHEN Z，MESGARANI N，et al.End-to-end microphone permutation and number invariant multi-channel speech separation[C]//Proceedings of the 2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：6394-6398.