Space-Time Gradient Iterative Voiceprint Adversarial Attack Algorithm STI-FGSM

doi:10.3778/j.issn.1002-8331.2207-0476

Abstract

Abstract: A space-time iterative fast gradient sign method（STI-FGSM） is proposed for the speaker recognition model in order to solve the problems of insufficient use of gradient information and poor transferability of current voiceprint adversarial attack algorithms. The algorithm fuses momentum and timing gradient information firstly based on the momentum iterative fast gradient sign method（MI-FGSM）, and uses the next observation gradient to correct the disturbance update direction. Then, the spatial gradient information is introduced to fully learn the region information of the speech samples and realize the accumulation of spatial gradient momentum in different regions. Finally, the perturbation ensemble method is combined to fully use known white-box models to achieve multi-model perturbation ensemble and further improve the black-box attack success rate. The experimental results show that the STI-FGSM algorithm achieves a strong white-box attack and high black-box attack success rate against four speaker recognition models, ResNetSE34V2, TDy_ResNet34_half, x-vector, and ECAPA-TDNN. The performance is better than other algorithms.

Key words: speaker recognition, adversarial attack, gradient, perturbation ensemble, white-box attack, black-box attack, transferability

摘要： 为了解决当前声纹对抗攻击算法梯度信息利用不足、迁移性较差等问题，针对说话人识别模型，提出一种时空迭代快速梯度符号法（space-time iterative fast gradient sign method，STI-FGSM）的声纹对抗攻击算法。该算法基于动量迭代快速梯度符号法（momentum iterative fast gradient sign method，MI-FGSM），融合动量和时序梯度信息，使用下一步观测梯度修正扰动更新方向。引入空间梯度信息，充分学习语音样本区域信息，实现不同区域的空间梯度动量累加。结合扰动集成的方法，充分利用已知的白盒模型，实现多模型扰动叠加，进一步提高黑盒攻击成功率。实验结果表明，STI-FGSM算法针对ResNetSE34V2、TDy_ResNet34_half、x-vector、ECAPA-TDNN四种说话人识别模型，均能取得较强的白盒攻击，并实现较高的黑盒攻击成功率，其性能优于其他算法。

关键词: 说话人识别, 对抗攻击, 梯度, 扰动集成, 白盒攻击, 黑盒攻击, 迁移性

LI Shuo, GU Yijun, TAN Hao. Space-Time Gradient Iterative Voiceprint Adversarial Attack Algorithm STI-FGSM[J]. Computer Engineering and Applications, 2023, 59(21): 151-158.

李烁, 顾益军, 谭昊. 时空梯度迭代的声纹对抗攻击算法STI-FGSM[J]. 计算机工程与应用, 2023, 59(21): 151-158.

References

[1] GOODFELLOW I J，SHLENS J，SZEGEDY C.Explaining and harnessing adversarial examples[J].arXiv：1412.6572，2014.
[2] KURAKIN A，GOODFELLOW I J，BENGIO S.Adversarial examples in the physical world[M]//Artificial intelligence safety and security.[S.l.]：Chapman and Hall/CRC，2018：99-112.
[3] DONG Y，LIAO F，PANG T，et al.Boosting adversarial attacks with momentum[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Salt Lake City，June 18-22，2018.New York：IEEE，2018：9185-9193.
[4] TAN H，ZHANG J，ZHANG H，et al.NRI-FGSM：an efficient transferable adversarial attack method for speaker recognition system[C]//Proceedings of the 23rd Annual Conference of the International Speech Communication Association，Incheon，September 18-22，2022.New York：IEEE，2022：18-22.
[5] XIAO C，LI B，ZHU J Y，et al.Generating adversarial examples with adversarial networks[J].arXiv：1801.02610，2018.
[6] JANDIAL S，MANGLA P，VARSHNEY S，et al.Advgan++：harnessing latent layers for adversary generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops，Seoul，October 27-28，2019.New York：IEEE，2019.
[7] CARLINI N，WAGNER D.Towards evaluating the robustness of neural networks[C]//2017 IEEE Symposium on Security and Privacy（SP），San Jose，May 22-26，2017.New York：IEEE，2017：39-57.
[8] CHEN G，CHENB S，FAN L，et al.Who is real bob? adversarial attacks on speaker recognition systems[C]//2021 IEEE Symposium on Security and Privacy（SP），San Francisco，May 24-27，2021.New York：IEEE，2021：694-711.
[9] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，June 27-30，2016.New York：IEEE，2016：770-778.
[10] WAIBEL A，HANAZAWA T，HINTON G，et al.Phoneme recognition using time-delay neural networks[J].IEEE Transactions on Acoustics，Speech，and Signal Processing，1989，37（3）：328-339.
[11] HEO H S，LEE B J，HUH J，et al.Clova baseline system for the voxceleb speaker recognition challenge 2020[J].arXiv：2009.14153，2020.
[12] KIM S H，NAM H，PARK Y H.Temporal dynamic convolutional neural network for text-independent speaker verification and phonemic analysis[C]//IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Singapore，May 22-27，2022.New York：IEEE，2022：6742-6746.
[13] SNYDER D，GARCIA-ROMERO D，SELL G，et al.X-vectors：robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Calgary，April 15-20，2018.New York：IEEE，2018：5329-5333.
[14] DESPLANQUES B，THIENPONDT J，DEMUYNCK K.Ecapa-tdnn：emphasized channel attention，propagation and aggregation in tdnn based speaker verification[J].arXiv：2005.07143，2020.
[15] PRINCE S J D，ELDER J H.Probabilistic linear discriminant analysis for inferences about identity[C]//2007 IEEE 11th International Conference on Computer Vision，Rio de Janeiro，October 14-20，2007.New York：IEEE，2007：1-8.
[16] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Salt Lake City，June 18-22，2018.New York：IEEE，2018：7132-7141.
[17] GAO S H，CHENG M M，ZHAO K，et al.Res2net：a new multi-scale backbone architecture[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2019，43（2）：652-662.
[18] OKABE K，KOSHINAKA T，SHINODA K.Attentive statistics pooling for deep speaker embedding[J].arXiv：1803.10963，2018.
[19] SZEGEDY C，ZAREMBA W，SUTSKEVER I，et al.Intriguing properties of neural networks[J].arXiv：1312. 6199，2013.
[20] LI X，ZHONG J，WU X，et al.Adversarial attacks on GMM i-vector based speaker verification systems[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Barcelona，May 4-8，2020.New York：IEEE，2020：6579-6583.
[21] 廖俊帆，顾益军，张培晶，等.端到端说话人辨认的对抗样本应用比较研究[J].计算机工程，2021，47（6）：132-141.
LIAO J F，GU Y J，ZHANG P J，et al.Comparative research on application of adversarial samples for end-to-end speaker identification[J].Computer Engineering，2021，47（6）：132-141.
[22] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems，2014.
[23] NESTEROV Y.A method for unconstrained convex minimization problem with the rate of convergence O（1/k2）[C]//Doklady AN USSR，1983：543-547.
[24] RUDER S.An overview of gradient descent optimization algorithms[J].arXiv：1609.04747，2016.
[25] ZHANG Y，JIANG Z，VILLALBA J，et al.Black-box attacks on spoofing countermeasures using transferability of adversarial examples[C]//INTERSPEECH，Shanghai，October 25-29，2020.NewYork：IEEE，2020：4238-4242.
[26] MOOSAVI-DEZFOOLI S M，FAWZI A，FAWZI O，et al.Universal adversarial perturbations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，July 21-26，2017.New York：IEEE，2017：1765-1773.
[27] NAGRANI A，CHUNG J S，ZISSERMAN A.Voxceleb：a large-scale speaker identification dataset[J].arXiv：1706.08612，2017.
[28] CHUNG J S，NAGRANI A，ZISSERMAN A.Voxceleb2：deep speaker recognition[J].arXiv：1806.05622，2018.