Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (12): 25-36.DOI: 10.3778/j.issn.1002-8331.2201-0371
• Research Hotspots and Reviews • Previous Articles Next Articles
FAN Junyi, YANG Jibin, ZHANG Xiongwei, ZHENG Changyan
Online:
2022-06-15
Published:
2022-06-15
范君怡,杨吉斌,张雄伟,郑昌艳
FAN Junyi, YANG Jibin, ZHANG Xiongwei, ZHENG Changyan. Research on Transformer-Based Single-Channel Speech Enhancement[J]. Computer Engineering and Applications, 2022, 58(12): 25-36.
范君怡, 杨吉斌, 张雄伟, 郑昌艳. 基于Transformer的单通道语音增强模型综述[J]. 计算机工程与应用, 2022, 58(12): 25-36.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2201-0371
[1] WANG W,XING C,WANG D,et al.A robust audio-visual speech enhancement model[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Barcelona,Spain,2020:7529-7533. [2] GOGATE M,DASHTIPOUR K,BELL P,et al.Deep neural network driven binaural audio visual speech separation[C]//2020 International Joint Conference on Neural Networks(IJCNN),Glasgow,UK,2020:1-7. [3] LI L,WANG D,ZHENG T F.Neural discriminant analysis for deep speaker embedding[J].arXiv:2005.11905,2020. [4] 陶智,赵鹤鸣,龚呈卉.基于听觉掩蔽效应和bark子波变换的语音增强[J].声学学报,2005,30(4):367-372. TAO Z,ZHAO H M,GONG C H.Speech enhancement based on masking properties of human auditory system and bark wavelet transform[J].Acta Acustica,2005,30(4):367-372. [5] ERKELENS J S,HENDRIKS R C,HEUSDENS R,et al.Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors[J].IEEE Transactions on Audio Speech and Language Processing,2007,15(6):1741-1752. [6] BORGSTROM B J,ALWAN A.Log-spectral amplitude estimation with generalized gamma distributions for speech enhancement[C]//2011 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Prague,Czech Republic,2011:10. [7] JU G H,LEE L S.A perceptually constrained gsvd-based approach for enhancing speech corrupted by colored noise[J].IEEE Transactions on Audio Speech and Language Processing,2007,15(1):119-134. [8] BOROWICZ A.A signal subspace approach to spatio-temporal prediction for multichannel speech enhancement[J].Eurasip Journal on Audio Speech and Music Processing,2015(1):1-12. [9] YAN Q,VASEGHI S,ZAVAREHEI E,et al.Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement[J].Computer Speech and Language,2008,22(1):69-83. [10] CHEN R F,CHAN C F,SO H C.Model-based speech enhancement with improved spectral envelope estimation via dynamics tracking[J].IEEE Transactions on Audio Speech and Language Processing,2012,20(4):1324-1336. [11] WANG Y X,WANG D L.Towards scaling up classification-based speech separation[J].IEEE Transactions on Audio,Speech,and Language Processing,2013,21(7):1381-1390. [12] HEALY E W,YOHO S E,WANG Y X,et al.An algorithm to improve speech recognition in noise for hearing-impaired listeners[J].Journal of the Acoustical Society of America,2013,134(4):3029-3038. [13] XU Y,DU J,DAI L,et al.A regression approach to speech enhancement based on deep neural networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2015,23(1):7-19. [14] WENINGER F,HERSHEY J R,ROUX J L,et al.Discriminatively trained recurrent neural networks for single-channel speech separation[C]//2014 IEEE Global Conference on Signal and Information Processing(GlobalSIP),Atlanta,GA,USA,2014:577-581. [15] WENINGER F,ERDOGAN H,WATANABE S,et al.Speech enhancement with LSTM recurrent neural networks and its application to noise-robust asr[C]//Latent Variable Analysis and Signal Separation,Cham,2015:91-99. [16] PARK S R,LEE J.A fully convolutional neural network for speech enhancement[J].arXiv:1609.07132,2016. [17] RETHAGE D,PONS J,SERRA X.A wavenet for speech denoising[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Calgary,AB,Canada,2018:5069-5073. [18] 张天骐,柏浩钧,叶绍鹏,等.基于门控残差卷积编解码网络的单通道语音增强方法[J].信号处理,2021,37(10):1986-1995. ZHANG T Q,BAI H J,YE S P,et al.Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network[J].Journal of Signal Processing,2021,37(10):1986-1995. [19] STOLLER D,EWERT S,DIXON S.Wave-u-net:a multi-scale neural network for end-to-end audio source separation[J].arXiv:1806.03185,2018. [20] CHOI H S,KIM J H,HUH J,et al.Phase-aware speech enhancement with deep complex u-net[J].arXiv:1903. 03107,2019. [21] DEFOSSEZ A,SYNNAEVE G,ADI Y.Real time speech enhancement in the waveform domain[J].arXiv:2006. 12847,2020. [22] 徐峰,李平.DVUGAN:基于STDCT的DDSP集成变分U-Net的语音增强[J].信号处理,2022,38(3):582-589. XU F,LI P.DVUGAN:DDSP integrated variational u-net speech enhancement based on STDCT[J].Journal of Signal Processing,2022,38(3):582-589. [23] ZHOU S,DONG L,XU S,et al.A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese[J].arXiv:1805. 06239,2018. [24] DAI Z,YANG Z,YANG Y,et al.Transformer-xl:attentive language models beyond a fixed-length context[J].arXiv:1901.02860,2019. [25] CHEN J,LU Y,YU Q,et al.Transunet:transformers make strong encoders for medical image segmentation[J].arXiv:2102.04306,2021. [26] KIM J,EL-KHAMY M,LEE J.T-gsa:transformer with Gaussian-weighted self-attention for speech enhancement[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Barcelona,Spain,2020:6649-6653. [27] YU W,ZHOU J,WANG H,et al.Setransformer:speech enhancement transformer[J].Cognitive Computation,2021. [28] WANG K,HE B,ZHU W P J A E P.Tstnn:two-stage transformer based neural network for speech enhancement in the time domain[J].arXiv:2103.09963,2021. [29] DANG F,CHEN H,ZHANG P.Dpt-fsnet:dual-path transformer based full-band and sub-band fusion network for speech enhancement[J].arXiv:2104.13002,2021. [30] 李斌.基于深度神经网络的单通道语音增强方法研究[D].杭州:浙江大学,2020. LI B.Research on single channel speech enhancement based on deep neural network[D].Hangzhou:Zhejiang University,2020. [31] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems,Long Beach,California,USA,2017:6000-6010. [32] SPERBER M,NIEHUES J,NEUBIG G,et al.Self-attentional acoustic models[J].arXiv:1803.09519,2018. [33] CHEN J,MAO Q,LIU D.Dual-path transformer network:direct context-aware modeling for end-to-end monaural speech separation[J].arXiv:2007.13975,2020. [34] YEH C F,MAHADEOKAR J,KALGAONKAR K,et al.Transformer-transducer:end-to-end speech recognition with self-attention[J].arXiv:1910.12977,2019. [35] HUANG W,HU W,YEUNG Y T,et al.Conv-transformer transducer:low latency,low frame rate,streamable end-to-end speech recognition[J].arXiv:2008.05750,2020. [36] MIAO H,CHENG G,GAO C,et al.Transformer-based online ctc/attention end-to-end speech recognition architecture[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Barcelona,Spain,2020:6084-6088. [37] O'MALLEY T,NARAYANAN A,WANG Q,et al.A conformer-based asr frontend for joint acoustic echo cancellation,speech enhancement and speech separation[J].arXiv:2111. 09935,2021. [38] KOIZUMI Y,KARITA S,WISDOM S,et al.Df-conformer:integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhancement[C]//2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA),New Paltz,NY,USA,2021:161-165. [39] DONG H,PAN J,XIANG L,et al.Multi-scale boosted dehazing network with dense feature fusion[J].arXiv:2004. 13388,2020. [40] 吕佳,马超,程超.改进的U-Net网络用于视网膜血管分割[J/OL].计算机科学与探索:1-12[2021-12-01].https://kns.cnki.net/kcms/detail/11.5602.TP.20210825.1945.007.html. LYU J,MA C,CHENG C.Improved U-Net network for retinal vascular segmentation[J].Journal of Frontiers of Computer Science and Technology:1-12[2021-12-01].https://kns.cnki.net/kcms/detail/11.5602.TP.20210825.1945.007.html. [41] VALENTINI-BOTINHAO C,WANG X,TAKAKI S,et al.Investigating RNN-based speech enhancement methods for noise-robust text-to-speech[C]//9th ISCA Speech Synthesis Workshop,2016:146-152. [42] VEAUX C,YAMAGISHI J,KING S.The voice bank corpus:design,collection and data analysis of a large regional accent speech database[C]//2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation(O-COCOSDA/CASLRE),Gurgaon,India,2013:1-4. [43] THIEMANN J,ITO N,VINCENT E.The diverse environments multi-channel acoustic noise database(demand):a database of multichannel environmental noise recordings[J].Journal of the Acoustical Society of America,2013,133(5):3591. [44] SNYDER D,CHEN G,POVEY D.Musan:a music,speech,and noise corpus[J].arXiv:1510.08484,2015. [45] BOTCHEV V J C R.Speech enhancement:theory and practice(2nd ed.)[J].Computing Reviews,2013,54(10):604-605. [46] TAAL C H,HENDRIKS R C,HEUSDENS R,et al.An evaluation of objective measures for intelligibility prediction of time-frequency weighted noisy speech[J].The Journal of the Acoustical Society of America,2011,130(5):3013-3027. [47] HU Y,LOIZOU P C.Evaluation of objective quality measures for speech enhancement[J].IEEE Transactions on Audio Speech Language Process,2008,16(1):229-238. [48] PASCUAL S,BONAFONTE A,SERRà J.Segan:speech enhancement generative adversarial network[J].arXiv:1703. 09452,2017. [49] YIN D,LUO C,XIONG Z,et al.Phasen:a phase-and-harmonics-aware speech enhancement network[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:9458-9465. |
[1] | GAO Guangshang. Survey on Attention Mechanisms in Deep Learning Recommendation Models [J]. Computer Engineering and Applications, 2022, 58(9): 9-18. |
[2] | JI Meng, HE Qinglong. AdaSVRG: Accelerating SVRG by Adaptive Learning Rate [J]. Computer Engineering and Applications, 2022, 58(9): 83-90. |
[3] | LUO Xianglong, GUO Huang, LIAO Cong, HAN Jing, WANG Lixin. Spatiotemporal Short-Term Traffic Flow Prediction Based on Broad Learning System [J]. Computer Engineering and Applications, 2022, 58(9): 181-186. |
[4] | HU Zhangfang, JIAN Fang, TANG Shanshan, MING Ziping, JIANG Bowen. DFSMN-T:Mandarin Speech Recognition with Language Model Transformer [J]. Computer Engineering and Applications, 2022, 58(9): 187-194. |
[5] | Alim Samat, Sirajahmat Ruzmamat, Maihefureti, Aishan Wumaier, Wushuer Silamu, Turgun Ebrayim. Research on Sentence Length Sensitivity in Neural Network Machine Translation [J]. Computer Engineering and Applications, 2022, 58(9): 195-200. |
[6] | CHEN Yixiao, Alifu·Kuerban, LIN Wenlong, YUAN Xu. CA-YOLOv5 for Crowded Pedestrian Detection [J]. Computer Engineering and Applications, 2022, 58(9): 238-245. |
[7] | FANG Yiqiu, LU Zhuang, GE Junwei. Forecasting Stock Prices with Combined RMSE Loss LSTM-CNN Model [J]. Computer Engineering and Applications, 2022, 58(9): 294-302. |
[8] | SHI Jie, YUAN Chenxiang, DING Fei, KONG Weixiang. Survey of Building Target Detection in SAR Images [J]. Computer Engineering and Applications, 2022, 58(8): 58-66. |
[9] | SUN Liujie, ZHAO Jin, WANG Wenju, ZHANG Yusen. Multi-Scale Transformer Lidar Point Cloud 3D Object Detection [J]. Computer Engineering and Applications, 2022, 58(8): 136-146. |
[10] | XIONG Fengguang, ZHANG Xin, HAN Xie, KUANG Liqun, LIU Huanle, JIA Jionghao. Research on Improved Semantic Segmentation of Remote Sensing [J]. Computer Engineering and Applications, 2022, 58(8): 185-190. |
[11] | YANG Jinfan, WANG Xiaoqiang, LIN Hao, LI Leixiao, YANG Yanyan, LI Kecen, GAO Jing. Review of One-Stage Vehicle Detection Algorithms Based on Deep Learning [J]. Computer Engineering and Applications, 2022, 58(7): 55-67. |
[12] | WANG Bin, LI Xin. Research on Multi-Source Domain Adaptive Algorithm Integrating Dynamic Residuals [J]. Computer Engineering and Applications, 2022, 58(7): 162-166. |
[13] | TAN Shuqiu, TANG Guofang, TU Yuanya, ZHANG Jianxun, GE Panjie. Classroom Monitoring Students Abnormal Behavior Detection System [J]. Computer Engineering and Applications, 2022, 58(7): 176-184. |
[14] | ZHANG Meiyu, LIU Yuehui, HOU Xianghui, QIN Xujia. Automatic Coloring Method for Gray Image Based on Convolutional Network [J]. Computer Engineering and Applications, 2022, 58(7): 229-236. |
[15] | ZHANG Zhuangzhuang, QU Licheng, LI Xiang, ZHANG Minghao, LI Zhaolu. Traffic Flow Prediction with Missing Data Based on Spatial-Temporal Convolutional Neural Networks [J]. Computer Engineering and Applications, 2022, 58(7): 259-265. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||