计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (22): 128-135.DOI: 10.3778/j.issn.1002-8331.2207-0352

• 模式识别与人工智能 • 上一篇    下一篇

两级U-Net波束形成网络的3D语音增强算法

林文模,陈飞龙,孙成立,朱祯君   

  1. 南昌航空大学 信息工程学院,南昌 330063
  • 出版日期:2023-11-15 发布日期:2023-11-15

3D Speech Enhancement Algorithm for Two-Stage U-Net Beamforming Network

LIN Wenmo, CHEN Feilong, SUN Chengli, ZHU Zhenjun   

  1. School of Information and Engineering, Nanchang Hangkong University, Nanchang 330063, China
  • Online:2023-11-15 Published:2023-11-15

摘要: 3D混响环境中的噪声对很多下游应用不利,开发适用于现实相近场景的3D语音增强技术,在实际生活中具有重要的理论意义和实用价值。针对此场景提出了一种用于3D语音增强的两级波束形成网络。该网络由两个连续的多输入单输出U-Net波束形成网络组成。第一级网络主要对来自双麦克风的3D语音信号进行波束形成粗估计,滤除部分信号噪声。为进一步改进估计,第二级网络则将粗估计信号的特征连同原始信号内全向信道信息特征作为输入,进行波束形成细估计,以得到更精确的估计信号,达到两级增强的目的。数据集和实验基于L3DAS22挑战赛的3D语音增强任务。该方法在盲测试集上获得的短时客观可懂度(short-time objective intelligibility,STOI)为0.925,字错误率(word error rate,WER)达到13.6%,明显优于L3DAS21 3D语音增强挑战赛中的冠军模型(0.878和21.2%)。

关键词: 语音增强, 3D语音信号, 深度学习, 波束形成

Abstract: The noise in the 3D reverberation environment is detrimental to many downstream applications. The development of 3D speech enhancement technology suitable for realistic scenes has important theoretical significance and practical value in real life. This paper proposes a two-stage beamforming network for 3D speech enhancement in this scenario. The network consists of two consecutive multiple-input single-output U-Net beamforming networks. The first-level network mainly performs rough beamforming estimation on the 3D speech signal from the dual microphones, and filters out part of the signal noise. In order to further improve the estimation, the second-level network takes the characteristics of the rough estimated signal together with the characteristics of the omnidirectional channel information in the original signal as input, and performs the beamforming fine estimation to obtain a more accurate estimated signal and achieve the purpose of two-level enhancement. The dataset and experiments are based on the 3D speech enhancement task of the L3DAS22 challenges. The short-time objective intelligibility (STOI) obtained by this method on the blind test set is 0.925, and the word error rate (WER) reaches 13.6%, which is significantly better than the L3DAS21 3D speech enhancement challenge, the champion model in the competition (0.878 and 21.2%).

Key words: speech enhancement, 3D speech signal, deep learning, beamforming