计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (7): 147-156.DOI: 10.3778/j.issn.1002-8331.2210-0145

• 模式识别与人工智能 • 上一篇    下一篇

基于Conformer的实时多场景说话人识别模型

宣茜,韩润萍,高静欣   

  1. 1.北京服装学院 文理学院,北京 100029
    2.北京服装学院 服装艺术与工程学院,北京 100029
  • 出版日期:2024-04-01 发布日期:2024-04-01

Conformer-Based Speaker Recognition Model for Real-Time Multi-Scenarios

XUAN Xi, HAN Runping, GAO Jingxin   

  1. 1.School of Arts and Sciences, Beijing Institute of Fashion Technology, Beijing 100029, China
    2.School of Fashion, Beijing Institute of Fashion Technology, Beijing 100029, China
  • Online:2024-04-01 Published:2024-04-01

摘要: 为解决在多场景(跨域、长时以及噪声干扰语音场景)下说话人确认系统性能较差的问题,提出了一种基于Conformer构建的、实时多场景鲁棒的说话人识别模型——PMS-Conformer。PMS-Conformer的设计灵感来自于先进的模型MFA-Conformer。PMS-Conformer对MFA-Conformer的声学特征提取器、网络组件和损失函数计算模块进行了改进,其具有新颖有效的声学特征提取器,以及鲁棒的、具有较强泛化能力的声纹嵌入码提取器。基于VoxCeleb1&2数据集实现了PMS-Conformer的训练;开展了PMS-Conformer与基线MFA-Conformer以及ECAPA-TDNN在说话人确认任务上的性能对比评估实验。实验结果表明在长语音SITW、跨域VoxMovies以及加噪处理的VoxCeleb-O测试集上,以PMS-Conformer构建的说话人确认系统的性能比用这两个基线构建的说话人确认系统更有竞争力;并且在声纹嵌入码提取器的可训练参数(Params)和推理速度(RTF)方面,PMS-Conformer明显优于ECAPA-TDNN。实验结果说明了PMS-Conformer在实时多场景下具有良好的性能。

关键词: 说话人确认, MFA-Conformer, Sub-center AAM-Softmax, 声纹嵌入码, 声学特征提取

Abstract: To handle the problems of poor performances of speaker verification systems, appearing in multiple scenarios with cross-domain utterances, long-duration utterances and noisy utterances, a real-time robust speaker recognition model, PMS-Conformer, is designed based on Conformer in this paper. The architecture of the PMS-Conformer is inspired by the state-of-the-art model named MFA-Conformer. PMS-Conformer has made the improvements on the acoustic feature extractor, network components and loss calculation module of MFA-Conformer respectively, having the novel and effective acoustic feature extractor and the robust speaker embedding extractor with high generalization?capability. PMS-Conformer is trained on VoxCeleb1&2 dataset, and it is compared with the baseline MFA-Conformer and ECAPA-TDNN, and extensive comparison experiments are conducted on the speaker verification tasks. The experimental results show that on VoxMovies with cross-domain utterances, SITW with long-duration utterances and VoxCeleb-O processed by adding noise to its utterances, the ASV system built with PMS-Conformer is more competitive than those built with MFA-Conformer and ECAPA-TDNN respectively. Moreover, the trainable Params and RTF of the speaker embedding extractor of PMS-Conformer are significantly lower than those of ECAPA-TDNN. All evaluation experiment results demonstrate that PMS-Conformer exhibits good performances in real-time multi-scenarios.

Key words: speaker verification, MFA-Conformer, Sub-center AAM-Softmax, speaker embedding, acoustic feature extraction