自动语音辨识对抗攻击和防御技术综述

doi:10.3778/j.issn.1002-8331.2202-0196

摘要/Abstract

摘要： 语音辨识技术是人机交互的重要方式。随着深度学习的不断发展，基于深度学习的自动语音辨识系统也取得了重要进展。然而，经过精心设计的音频对抗样本可以使得基于神经网络的自动语音辨识系统产生错误，给基于语音辨识系统的应用带来安全风险。为了提升基于神经网络的自动语音辨识系统的安全性，需要对音频对抗样本的攻击和防御进行研究。基于此，分析总结对抗样本生成和防御技术的研究现状，介绍自动语音辨识系统对抗样本攻击和防御技术面临的挑战和解决思路。

关键词: 自动语音辨识, 深度学习, 对抗攻击, 对抗防御

Abstract: Speech recognition technology is an important way of human-computer interaction. With the continuous development of deep learning, automatic speech recognition system based on deep learning has also made important progress. However, well-designed audio adversarial examples can cause errors in the automatic speech recognition system based on neural network, and bring security risks to the application of combined speech recognition system. In order to improve the security of automatic speech recognition system based on neural network, it is necessary to study the attack and defense of audio adversarial examples. Firstly, the research status of adversarial examples generation and defense technology is analyzed and summarized. Then automatic speech recognition system audio adversarial examples attack and defense techniques and related challenges and solutions are introduced.

Key words: automatic speech recognition, deep learning, adversarial attack, adversarial defense

李克资, 徐洋, 张思聪, 闫嘉乐. 自动语音辨识对抗攻击和防御技术综述[J]. 计算机工程与应用, 2022, 58(14): 1-15.

LI Kezi, XU Yang, ZHANG Sicong, YAN Jiale. Survey on Adversarial Example Attack and Defense Technology for Automatic Speech Recognition[J]. Computer Engineering and Applications, 2022, 58(14): 1-15.

参考文献

[1] VACHER M，SERIGNAT J F，CHAILLOL S.Sound classification in a smart room environment：an approach using GMM and HMM methods[C]//4th IEEE Conference on Speech Technology and Human-Computer Dialogue（SpeD 2007），2007：135-146.
[2] BANSAL P，KANT A，KUMAR S，et al.Improved hybrid model of HMM/GMM for speech recognition[J].Technologies and Applications，2008：69.
[3] ZOU Q，NI L，WANG Q，et al.Robust gait recognition by integrating inertial and RGBD sensors[J].IEEE Trans Cybern，2018，48（4）：1136-1150.
[4] SZEGEDY C，LIU W，JIA Y，et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：1-9.
[5] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[6] POVEY D，GHOSHAL A，BOULIANNE G，et al.The kaldi speech recognition toolkit[C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding，2011.
[7] HANNUN A，CASE C，CASPER J，et al.Deep speech：scaling up end-to-end speech recognition[J].arXiv：1412.5567，2014.
[8] SU J，VARGAS D V，SAKURAI K.One pixel attack for fooling deep neural networks[J].IEEE Transactions on Evolutionary Computation，2019，23（5）：828-841.
[9] XIE C，WANG J，ZHANG Z，et al.Adversarial examples for semantic segmentation and object detection[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：1369-1378.
[10] SONG D，EYKHOLT K，EVTIMOV I，et al.Physical adversarial examples for object detectors[J].arXiv：1807.07769，2018.
[11] REN S，DENG Y，HE K，et al.Generating natural language adversarial examples through probability weighted word saliency[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics，2019：1085-1097.
[12] VAIDYA T，ZHANG Y，SHERR M，et al.Cocaine noodles：exploiting the gap between human and machine speech recognition[C]//9th USENIX Conference on Offensive Technologies，2015.
[13] CARLINI N，MISHRA P，VAIDYA T，et al.Hidden voice commands[C]//Proceedings of the 25th USENIX Conference on Security Symposium（SEC’16），2016：513-530.
[14] ZHANG G，YAN C，JI X，et al.Dolphinattack：inaudible voice commands[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security，2017：103-117.
[15] SONG L，MITTAL P.Inaudible voice commands[C]//2017 ACM SIGSAC Conference on Computer and Communications Security，2017.
[16] ROY N，HASSANIEH H，CHOUDHURY R R.Backdoor：making microphones hear inaudible sounds[C]//15th Annual International Conference，2017.
[17] YUAN X J，CHEN Y X，ZHAO Y，et al.Commandersong：a systematic approach for practical adversarial voice recognition[J].arXiv：1801.08535，2018.
[18] CARLINI N，WAGNER D.Audio adversarial examples：targeted attacks on speech-to-text[C]//2018 IEEE Security and Privacy Workshops（SPW），2018：1-7.
[19] ALZANTOT M，BALAJI B，SRIVASTAVA M.Did you hear that? Adversarial examples against automatic speech recognition[J].arXiv：1801.00554，2018.
[20] SAINATH T N，PARADA C.Convolutional neural networks for small-footprint keyword spotting[C]//Sixteenth Annual Conference of the International Speech Communication Association，2015.
[21] TAORI R，KAMSETTY A，CHU B，et al.Psychoacoustic ples for black box audio systems[C]//2019 IEEE Security and Privacy Workshops（SPW），2019：15-20.
[22] KHARE S，ARALIKATTE R，MANI S.Adversarial black-box attacks on automatic speech recognition systems using multi-objective evolutionary optimization[C]//Interspeech 2019，2019.
[23] GUO C，RANA M，CISSE M，et al.Countering adversarial images using input transformations[C]//International Conference on Learning Representations，2018.
[24] LIN J，GAN C，HAN S.Defensive quantization：when efficiency meets robustness[C]//International Conference on Learning Representations，2018.
[25] LIANG B，LI H，SU M，et al.Detecting adversarial image examples in deep neural networks with adaptive noise reduction[J].IEEE Transactions on Dependable and Secure Computing，2021，18（1）：72-85.
[26] GOODFELLOW I J，SHLENS J，SZEGEDY C.Explaining and harnessing adversarial examples[J].arXiv：1412.6572，2014.
[27] PAPERNOT N，MCDANIEL P，WU X，et al.Distillation as a defense to adversarial perturbations against deep neural networks[C]//2016 IEEE Symposium on Security and Privacy（SP），2016：582-597.
[28] MOOSAVI-DEZFOOLI S M，FAWZI A，FAWZI O，et al.Universal adversarial perturbations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：1765-1773.
[29] MOOSAVI-DEZFOOLI S M，FAWZI A，FROSSARD P.Deepfool：a simple and accurate method to fool deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：2574-2582.
[30] VADILLO J，SANTANA R.Universal adversarial examples in speech command classification[J].arXiv：1911.10182，2019.
[31] ABDOLI S，HAFEMANN L G，RONY J，et al.Universal adversarial audio perturbations[J].arXiv：1908.03173，2019.
[32] RONY J，HAFEMANN L G，OLIVEIRA L S，et al.Decoupling direction and norm for efficient gradient-based l2 adversarial attacks and defenses[J].IEEE/CVF Conference on Computer Vision & Pattern Recognition，2018.
[33] NEEKHARA P，HUSSAIN S，PANDEY P，et al.Universal adversarial perturbations for speech recognition systems[J].arXiv：1905.03828，2019.
[34] YU J L，BO L.A normalized levenshtein distance me-
tric[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2007，29（6）：1091-1095.
[35] LU Z，HAN W，ZHANG Y，et al.Exploring targeted universal adversarial perturbations to end-to-end asr models[J].arXiv：2104.02757，2021.
[36] CHAN W，JAITLY N，LE Q，et al.Listen，attend and spell：a neural network for large vocabulary conversational speech recognition[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2016：4960-4964.
[37] GRAVES A，FERNáNDEZ S，GOMEZ F，et al.Connectionist temporal classification：labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning，2006：369-376.
[38] GRAVES A.Sequence transduction with recurrent neural networks[J].arXiv：1211.3711，2012.
[39] WANG D H，DONG L，WANG R，et al.Targeted speech adversarial example generation with generative adversarial network[J].IEEE Access，2020，8：124503-124513.
[40] XIE Y，LI Z，SHI C，et al.Enabling fast and universal audio adversarial attack using generative model[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：14129-14137.
[41] WANG Y，YAO H，ZHAO S.Auto-encoder based dimensionality reduction[J].Neurocomputing，2016，184：232-242.
[42] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems，2014.
[43] YAKURA H，SAKUMA J.Robust audio adversarial example for a physical attack[J].arXiv：1810.11793，2018.
[44] ATHALYE A，ENGSTROM L，ILYAS A，et al.Synthesizing robust adversarial examples[C]//International Conference on Machine Learning，2018：284-293.
[45] QIN Y，CARLINI N，COTTRELL G，et al.Imperceptible，robust，and targeted adversarial examples for automatic speech recognition[C]//International Conference on Machine Learning，2019：5231-5240.
[46] SCHEIBLER R，BEZZAM E，DOKMANIC I.Pyroomacoustics：a python package for audio room simulation and array processing algorithms[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：351-355.
[47] SZURLEY J，KOLTER J Z.Perceptual based adversarial audio attacks[J].arXiv：1906.06355，2019.
[48] SCH?NHERR L，EISENHOFER T，ZEILER S，et al.Imperio：robust over-the-air adversarial examples for automatic speech recognition systems[C]//Annual Computer Security Applications Conference，2020：843-855.
[49] CHEN T，SHANGGUAN L，LI Z，et al.Metamorph：injecting inaudible commands into over-the-air voice controlled systems[C]//Proceedings of NDSS，2020.
[50] LIU X，WAN K，DING Y，et al.Weighted-sampling audio adversarial example attack[J].Proceedings of the AAAI Conference on Artificial Intelligence，2020，34（4）：4908-4915.
[51] ESMAEILPOUR M，CARDINAL P，KOERICH A L.Towards robust speech-to-text adversarial attack[J].arXiv：2103.
08095，2021.
[52] SHEN J，NGUYEN P，WU Y，et al.Lingvo：a modular and scalable framework for sequence-to-sequence modeling[J].arXiv：1902.08295，2019.
[53] SCH?NHERR L，KOHLS K，ZEILER S，et al.Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding[J].arXiv：1808.05665，2018.
[54] RUDIN L I，OSHER S，FATEMI E.Nonlinear total variation based noise removal algorithms[J].Physica D：Nonlinear Phenomena，1992，60：259-268.
[55] MURATA T，ISHIBUCHI H.Moga：multi-objective genetic algorithms[C]//IEEE International Conference on Evolutionary Computation，1995：289-294.
[56] DEB K，PRATAP A，AGARWAL S，et al.A fast and elitist multiobjective genetic algorithm：NSGA-II[J].IEEE Transactions on Evolutionary Computation，2002，6（2）：182-197.
[57] ABDULLAH H，GARCIA W，PEETERS C，et al.Practical hidden voice attacks against speech and speaker recognition systems[J].arXiv：1904.05734，2019.
[58] CHEN Y，YUAN X，ZHANG J，et al.Devil’s whisper：a general approach for physical adversarial attacks against commercial black-box speech recognition devices[C]//29th USENIX Conference on Security Symposium，2020：2667-2684.
[59] ISHIDA S，ONO S.Adjust-free adversarial example generation in speech recognition using evolutionary multi-objective optimization under black-box condition[J].Artificial Life and Robotics，2021，26（2）：243-249.
[60] MADRY A，MAKELOV A，SCHMIDT L，et al.Towards deep learning models resistant to adversarial attacks[C]//International Conference on Learning Representations，2018.
[61] SUN S，YEH C F，OSTENDORF M，et al.Training augmentation with adversarial examples for robust speech recognition[C]//Interspeech 2018，2018.
[62] HINTON G，VINYALS O，DEAN J.Distilling the knowledge in a neural network[J].arXiv：1503.02531，2015.
[63] DAS N，SHANBHOGUE M，CHEN S T，et al.Adagio：interactive experimentation with adversarial attack and defense for audio[C]//European Conference，ECML PKDD 2018，Dublin，Ireland，September 10-14，2018.
[64] LATIF S，RANA R，QADIR J.Adversarial machine learning and speech emotion recognition：utilizing generative adversarial networks for robustness[J].arXiv：1811.11402，2018.
[65] ESMAEILPOUR M，CARDINAL P，KOERICH A L.Class-conditional defense GAN against end-to-end speech attacks[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：2565-2569.
[66] ESMAEILPOUR M，CARDINAL P，KOERICH A L.A robust approach for securing audio classification against adversarial attacks[J].IEEE Transactions on Information Forensics and Security，2019，15：2147-2159.
[67] TAMURA K，OMAGARI A，HASHIDA S.Novel defense method against audio adversarial example for speech-to-text transcription neural networks[C]//2019 IEEE 11th International Workshop on Computational Intelligence and Applications（IWCIA），2019：115-120.
[68] YANG C H，QI J，CHEN P Y，et al.Characterizing speech adversarial examples using self-attention u-net enhancement[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：3107-3111.
[69] RAJARATNAM K，SHAH K，KALITA J.Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition[C]//Conference on Computational Linguistics and Speech Processing（ROCLING），2018.
[70] SAMIZADE S，TAN Z H，SHEN C，et al.Adversarial example detection by classification for deep speech recognition[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：3102-3106.
[71] ZENG Q，SU J，FU C，et al.A multiversion programming inspired approach to detecting audio adversarial examples[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks（DSN），2019.
[72] RAJARATNAM K，KALITA J.Noise flooding for detecting audio adversarial examples against automatic speech recognition[C]//2018 IEEE International Symposium on Signal Processing and Information Technology（ISSPIT），2018.
[73] KWON H，YOON H，PARK K W.Poster：detecting audio adversarial example through audio modification[C]//Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security，2019：2521-2523.
[74] YANG Z，CHEN P Y，LI B，et al.Characterizing audio adversarial examples using temporal dependency[C]//7th International Conference on Learning Representations，2019.
[75] MA P，PETRIDIS S，PANTIC M.Detecting adversarial attacks on audio-visual speech recognition[J].arXiv：1912.08639，2019.
[76] LIU Y P，CHEN X Y，LIU C，et al.Delving into transferable adversarial examples and black-box attacks[C]//International Conference on Learning Representations，2017.
[77] CISSE M，ADI Y，NEVEROVA N，et al.Houdini：fooling deep structured prediction models[J].arXiv：1707.05373，2017.
[78] AMODEI D，ANANTHANARAYANAN S，ANUBHAI R，et al.Deep speech 2：end-to-end speech recognition in English and Mandarin[C]//International Conference on Machine Learning，2016：173-182.
[79] KREUK F，ADI Y，CISS′E M，et al.Fooling end-to-end speaker verification with adversarial examples[C]//IEEE International Conference on Acoustics，Speech and Signal Processing，2018：1962-1966.
[80] 董胤蓬，苏航，朱军.面向对抗样本的深度神经网络可解释性分析[J].自动化学报，2022，48（1）：75-86.
DONG Y P，SU H，ZHU J.Interpretability analysis of deep neural networks with adversarial examples[J].Acta Automatica Sinica，2022，48（1）：75-86.
[81] HU S，SHANG X，QIN Z，et al.Adversarial examples for automatic speech recognition：attacks and countermeasures[J].IEEE Communications Magazine，2019，57（10）：120-126.
[82] ABDULLAH H，WARREN K，BINDSCHAEDLER V，et al.SoK：the faults in our ASRs：an overview of attacks against automatic speech recognition and speaker identification systems[C]//2021 IEEE Symposium on Security and Privacy（SP），2021：730-747.
[83] 刘会，赵波，郭嘉宝，等.针对深度学习的对抗攻击综述[J].密码学报，2021，8（2）：202-214.
LIU H，ZHAO B，GUO J B，et al.Survey on adversarial attacks towards deep learning[J].Journal of Cryptologic Research，2021，8（2）：202-214.
[84] 潘文雯，王新宇，宋明黎，等.对抗样本生成技术综述[J].软件学报，2020，31（1）：67-81.
PAN W W，WANG X Y，SONG M L，et al.Survey on generating adversarial examples[J].Journal of Software，2020，31（1）：67-81.
[85] 张思思，左信，刘建伟.深度学习中的对抗样本问题[J].计算机学报，2019，42（8）：1886-1904.
ZHANG S S，ZUO X，LIU J W.The problem of the adversarial examples in deep learning[J].Chinese Journal of Computers，2019，42（8）：1886-1904.
[86] 张树栋，高海昌，曹曦文，等.针对ASR系统的快速有目标自适应对抗攻击[J].西安电子科技大学学报，2021，48（1）：168-175.
ZHANG S D，GAO H C，CAO X W，et al.Adaptive fast and targeted adversarial attack for speech recognition[J].Journal of Xidian Universarity，2021，48（1）：1886-1904.
[87] 王曙燕，金航，孙家泽.GAN图像对抗样本生成方法[J].计算机科学与探索，2021，15（4）：702-711.
WANG S Y，JIN H，SUN J Z.Method for image adversarial samples generating based on GAN[J].Journal of Frontiers of Computer Science and Technology，2021，15（4）：702-711.
[88] 陈晋音，叶林辉，郑海斌，等.面向语音识别系统的黑盒对抗攻击方法[J].小型微型计算机系统，2020，41（5）：1019-1029.
CHEN J Y，YE L H，ZHENG H B，et al.Black-box adversarial attack toward speech recognition system[J].Journal of Chinese Computer Systems，2020，41（5）：1019-1029.