计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (21): 324-332.DOI: 10.3778/j.issn.1002-8331.2408-0403

• 工程与应用 • 上一篇    下一篇

刻意伪装场景下的说话人确认

覃晓逸,励泽,刘东,李明   

  1. 1.武汉大学 计算机学院,武汉 430072 
    2.昆山杜克大学 苏州市多模态智能系统重点实验室,江苏 苏州 215316
  • 出版日期:2025-11-01 发布日期:2025-10-31

Speaker Verification in Deliberately Disguised Scenarios

QIN Xiaoyi, LI Ze, LIU Dong, LI Ming   

  1. 1.School of Computer Science, Wuhan University, Wuhan 430072, China
    2.Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Duke Kunshan University, Suzhou, Jiangsu 215316, China
  • Online:2025-11-01 Published:2025-10-31

摘要: 刻意伪装的说话人确认任务的难点在于说话人刻意隐藏自己的身份而改变音色成为其他人。将这一任务视为一人饰演多角的场景,并为此提出了CN-Moives训练集和TheSound-test测试集。CN-Moives数据通过对中文电影中的演员及配音演员进行人物匹配、人脸检测、人脸识别、唇动识别和语音活动片段检测,获取了一人多部戏的多个角色的语音片段。该数据集包含了演员原声和对应的配音演员,利用演员和配音演员为塑造角色而有意改变自己音色的特性,实现了刻意伪装中一人多角数据的采集。同时,利用TheSound节目中配音演员刻意隐藏自身身份不被识破的节目特性,提出刻意伪装场景的测试集TheSound-test。通过联合以上领域挖掘的数据,采用孪生网络建模,在VoxMoives测试集和TheSound-test集上均取得了说话人验证性能的显著提升。

关键词: 说话人确认, 刻意伪装, 孪生网络

Abstract: The challenge in the task of deliberately disguised speaker verification lies in the speaker intentionally altering their voice to become someone else and thereby concealing their identity. This task is viewed as a scenario where one person plays multiple roles, for which the CN-Movies training set and TheSound-test testing set are proposed. The CN-Movies dataset is constructed by matching characters, detecting faces, recognizing faces, lip movement recognition, and voice activity detection in Chinese movies featuring actors and voice actors. This dataset includes the original voices of actors and their corresponding voice actors, leveraging the characteristics of actors and voice actors intentionally altering their voice to portray different roles, thus facilitating the collection of multi-role data for deliberate disguise. Additionally, utilizing the feature of the program TheSound, where voice actors intentionally hide their identities to avoid being recognized, the TheSound-test is proposed as a testing set for deliberate disguise scenarios. By combining the data mined from the above fields, a siamese network model is employed, achieving significant improvements in speaker verification performance on both the VoxMovies test set and TheSound-test set.

Key words: speaker verification, deliberate disguise, siamese network