刻意伪装场景下的说话人确认

doi:10.3778/j.issn.1002-8331.2408-0403

摘要/Abstract

摘要： 刻意伪装的说话人确认任务的难点在于说话人刻意隐藏自己的身份而改变音色成为其他人。将这一任务视为一人饰演多角的场景，并为此提出了CN-Moives训练集和TheSound-test测试集。CN-Moives数据通过对中文电影中的演员及配音演员进行人物匹配、人脸检测、人脸识别、唇动识别和语音活动片段检测，获取了一人多部戏的多个角色的语音片段。该数据集包含了演员原声和对应的配音演员，利用演员和配音演员为塑造角色而有意改变自己音色的特性，实现了刻意伪装中一人多角数据的采集。同时，利用TheSound节目中配音演员刻意隐藏自身身份不被识破的节目特性，提出刻意伪装场景的测试集TheSound-test。通过联合以上领域挖掘的数据，采用孪生网络建模，在VoxMoives测试集和TheSound-test集上均取得了说话人验证性能的显著提升。

关键词: 说话人确认, 刻意伪装, 孪生网络

Abstract: The challenge in the task of deliberately disguised speaker verification lies in the speaker intentionally altering their voice to become someone else and thereby concealing their identity. This task is viewed as a scenario where one person plays multiple roles, for which the CN-Movies training set and TheSound-test testing set are proposed. The CN-Movies dataset is constructed by matching characters, detecting faces, recognizing faces, lip movement recognition, and voice activity detection in Chinese movies featuring actors and voice actors. This dataset includes the original voices of actors and their corresponding voice actors, leveraging the characteristics of actors and voice actors intentionally altering their voice to portray different roles, thus facilitating the collection of multi-role data for deliberate disguise. Additionally, utilizing the feature of the program TheSound, where voice actors intentionally hide their identities to avoid being recognized, the TheSound-test is proposed as a testing set for deliberate disguise scenarios. By combining the data mined from the above fields, a siamese network model is employed, achieving significant improvements in speaker verification performance on both the VoxMovies test set and TheSound-test set.

Key words: speaker verification, deliberate disguise, siamese network

覃晓逸, 励泽, 刘东, 李明. 刻意伪装场景下的说话人确认[J]. 计算机工程与应用, 2025, 61(21): 324-332.

QIN Xiaoyi, LI Ze, LIU Dong, LI Ming. Speaker Verification in Deliberately Disguised Scenarios[J]. Computer Engineering and Applications, 2025, 61(21): 324-332.

参考文献

[1] HANSEN J H L, HASAN T. Speaker recognition by machines and humans: a tutorial review[J]. IEEE Signal Processing Magazine, 2015, 32(6): 74-99.
[2] FARRúS M. Voice disguise in automatic speaker recognition[J]. ACM Computing Surveys, 2018, 51(4): 1-22.
[3] TAN X, QIN T, SOONG F, et al. A survey on neural speech synthesis[J]. arXiv:2106.15561, 2021.
[4] WANG C Y, CHEN S Y, WU Y, et al. Neural codec language models are zero-shot text to speech synthesizers[J]. arXiv:2301.02111, 2023.
[5] LIU X C, WANG X, SAHIDULLAH M, et al. ASVspoof 2021: towards spoofed and deepfake speech detection in the wild[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 2507-2522.
[6] YI J Y, TAO J H, FU R B, et al. ADD 2023: the second audio deepfake detection challenge[J]. arXiv:2305.13774, 2023.
[7] KüNZEL H J. Effects of voice disguise on speaking fundamental frequency[J]. The International Journal of Speech, Language and the Law, 2000, 7(2): 149-179.
[8] KüNZEL H J, GONZALEZ-RODRIGUEZ J, ORTEGA-GARCíA J. Effect of voice disguise on the performance of a forensic automatic speaker recognition system[C]//Proceedings of the ODYSSEY04-The Speaker and Language Recognition Workshop, 2004.
[9] KAJAREKAR S S, BRATT H, SHRIBERG E, et al. A study of intentional voice modifications for evading automatic speaker recognition[C]//Proceedings of the 2006 IEEE Odyssey-The Speaker and Language Recognition Workshop. Piscataway: IEEE, 2006: 1-6.
[10] ZHANG C L, TAN T J. Voice disguise and automatic speaker recognition[J]. Forensic Science International, 2008, 175(2/3): 118-122.
[11] TAN T J. The effect of voice disguise on Automatic Speaker Recognition[C]//Proceedings of the 2010 3rd International Congress on Image and Signal Processing. Piscataway: IEEE, 2010: 3538-3541.
[12] GONZáLEZ HAUTAM?KI R, SAHIDULLAH M, HAUTAM?KI V, et al. Acoustical and perceptual study of voice disguise by age modification in speaker verification[J]. Speech Communication, 2017, 95: 1-15.
[13] ZHENG L L, LI J K, SUN M, et al. When automatic voice disguise meets automatic speaker verification[J]. IEEE Transactions on Information Forensics and Security, 2021, 16: 824-837.
[14] BROWN A, HUH J, NAGRANI A, et al. Playing a part: speaker verification at the movies[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6174-6178.
[15] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset[J]. arXiv:1706. 08612, 2017.
[16] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: deep speaker recognition[J]. arXiv:1806.05622, 2018.
[17] FAN Y, KANG J W, LI L T, et al. CN-celeb: a challenging Chinese speaker recognition dataset[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 7604-7608.
[18] SERENGIL S I, OZPINAR A. HyperExtended LightFace: a facial attribute analysis framework[C]//Proceedings of the 2021 International Conference on Engineering and Emerging Technologies. Piscataway: IEEE, 2022: 1-4.
[19] DENG J K, GUO J, VERVERAS E, et al. RetinaFace: single-shot multi-level face localisation in the wild[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5202-5211.
[20] TAO R J, PAN Z X, DAS R K, et al. Is someone speaking? : exploring long-term temporal features for audio-visual active speaker detection[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 3927-3935.
[21] BROMLEY J, BENTZ J W, BOTTOU L, et al. Signature verification using a “Siamese” time delay neural network[C]//Advances in Pattern Recognition Systems Using Neural Network Technologies, 1994: 25-44.
[22] LIN Y K, QIN X Y, CUI H H, et al. Laugh betrays you? learning robust speaker representation from speech containing non-verbal fragments[J]. arXiv:2210.16028, 2022.
[23] DENG J K, GUO J, XUE N N, et al. ArcFace: additive angular margin loss for deep face recognition[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 4685-4694.
[24] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[25] SNYDER D, CHEN G G, POVEY D. MUSAN: a music, speech, and noise corpus[J]. arXiv:1510.08484, 2015.
[26] KO T, PEDDINTI V, POVEY D, et al. A study on data augmentation of reverberant speech for robust speech recognition[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2017: 5220-5224.