Conformer-Based Speaker Recognition Model for Real-Time Multi-Scenarios

doi:10.3778/j.issn.1002-8331.2210-0145

Abstract

Abstract: To handle the problems of poor performances of speaker verification systems, appearing in multiple scenarios with cross-domain utterances, long-duration utterances and noisy utterances, a real-time robust speaker recognition model, PMS-Conformer, is designed based on Conformer in this paper. The architecture of the PMS-Conformer is inspired by the state-of-the-art model named MFA-Conformer. PMS-Conformer has made the improvements on the acoustic feature extractor, network components and loss calculation module of MFA-Conformer respectively, having the novel and effective acoustic feature extractor and the robust speaker embedding extractor with high generalization?capability. PMS-Conformer is trained on VoxCeleb1&2 dataset, and it is compared with the baseline MFA-Conformer and ECAPA-TDNN, and extensive comparison experiments are conducted on the speaker verification tasks. The experimental results show that on VoxMovies with cross-domain utterances, SITW with long-duration utterances and VoxCeleb-O processed by adding noise to its utterances, the ASV system built with PMS-Conformer is more competitive than those built with MFA-Conformer and ECAPA-TDNN respectively. Moreover, the trainable Params and RTF of the speaker embedding extractor of PMS-Conformer are significantly lower than those of ECAPA-TDNN. All evaluation experiment results demonstrate that PMS-Conformer exhibits good performances in real-time multi-scenarios.

Key words: speaker verification, MFA-Conformer, Sub-center AAM-Softmax, speaker embedding, acoustic feature extraction

摘要： 为解决在多场景（跨域、长时以及噪声干扰语音场景）下说话人确认系统性能较差的问题，提出了一种基于Conformer构建的、实时多场景鲁棒的说话人识别模型——PMS-Conformer。PMS-Conformer的设计灵感来自于先进的模型MFA-Conformer。PMS-Conformer对MFA-Conformer的声学特征提取器、网络组件和损失函数计算模块进行了改进，其具有新颖有效的声学特征提取器，以及鲁棒的、具有较强泛化能力的声纹嵌入码提取器。基于VoxCeleb1&2数据集实现了PMS-Conformer的训练；开展了PMS-Conformer与基线MFA-Conformer以及ECAPA-TDNN在说话人确认任务上的性能对比评估实验。实验结果表明在长语音SITW、跨域VoxMovies以及加噪处理的VoxCeleb-O测试集上，以PMS-Conformer构建的说话人确认系统的性能比用这两个基线构建的说话人确认系统更有竞争力；并且在声纹嵌入码提取器的可训练参数（Params）和推理速度（RTF）方面，PMS-Conformer明显优于ECAPA-TDNN。实验结果说明了PMS-Conformer在实时多场景下具有良好的性能。

关键词: 说话人确认, MFA-Conformer, Sub-center AAM-Softmax, 声纹嵌入码, 声学特征提取

XUAN Xi, HAN Runping, GAO Jingxin. Conformer-Based Speaker Recognition Model for Real-Time Multi-Scenarios[J]. Computer Engineering and Applications, 2024, 60(7): 147-156.

宣茜, 韩润萍, 高静欣. 基于Conformer的实时多场景说话人识别模型[J]. 计算机工程与应用, 2024, 60(7): 147-156.

References

[1] HANSEN J H L, HASAN T. Speaker recognition by machines and humans: a tutorial review[J]. IEEE Signal Processing Magazine, 2015, 32(6): 74-99.
[2] VARIANI E, LEI X, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014: 4052-4056.
[3] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-vectors: robust DNN embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 5329-5333.
[4] CHUNG J S, HUH J, MUN S, et al. In defence of metric learning for speaker recognition[J]. arXiv:2003.11982, 2020.
[5] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[6] GAO S H, CHENG M M, ZHAO K, et al. Res2net: a new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652-662.
[7] DESPLANQUES B, THIENPONDT J, DEMUYNCK K. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification[J]. arXiv:2005.07143, 2020.
[8] THIENPONDT J, DESPLANQUES B, DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D resnets to enhance speaker verification[J]. arXiv:2104.02370, 2021.
[9] LIU T, DAS R K, LEE K A, et al. MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances[C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: 7517-7521.
[10] ZHAO M, MA Y, LIU M, et al. The SpeakIn system for VoxCeleb speaker recognition challange 2021[J]. arXiv:2109.
01989, 2021.
[11] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[12] 陈志高, 李鹏, 肖润秋, 等. 文本无关说话人识别的一种多尺度特征提取方法[J]. 电子与信息学报, 2021, 43(11): 3266-3271.
CHEN Z G, LI P, XIAO R Q, et al. A multiscale feature extraction method for text-independent speaker recognition[J]. Journal of Electronics & Information Technology, 2021, 43(11): 3266-3271.
[13] 邓力洪, 邓飞, 张葛祥, 等. 改进Res2Net的多尺度端到端说话人识别系统[J]. 计算机工程与应用, 2023, 59(24): 110-120.
DENG L H, DENG F, ZHANG G X, et al. Multi-scale end-to-end speaker recognition system based on improved Res2Net[J]. Computer Engineering and Applications , 2023, 59(24): 110-120.
[14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[15] ZHOU H, ZHANG S, PENG J, et al. Informer: beyond efficient transformer for long sequence time-series forecasting[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 11106-11115.
[16] LAVRENTYEVA G, NOVOSELOV S, VOLOKHOV V, et al. STC speaker recognition system for the NIST SRE 2021[C]//Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2022), 2022: 354-361.
[17] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[18] BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems, 2020: 12449-12460.
[19] WANG R, AO J, ZHOU L, et al. Multi-view self-attention based transformer for speaker recognition[C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: 6732-6736.
[20] CHEN S, WANG C, CHEN Z, et al. WavLM: large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1-14.
[21] GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented transformer for speech recognition[J]. arXiv:2005.08100, 2020.
[22] KOIZUMI Y, KARITA S, WISDOM S, et al. DF-Conformer: integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement[C]//2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021: 161-165.
[23] CHEN S, WU Y, CHEN Z, et al. Continuous speech separation with conformer[C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: 5749-5753.
[24] ZHANG Y, LV Z, WU H, et al. MFA-Conformer: multi-scale feature aggregation conformer for automatic speaker verification[J]. arXiv:2203.15249, 2022.
[25] LIAO D, JIANG T, WANG F, et al. Towards a unified conformer structure: from ASR to ASV Task[J]. arXiv:2211.07201, 2022.
[26] DENG J, GUO J, LIU T, et al. Sub-center ArcFace: boosting face recognition by large-scale noisy web faces[C]//European Conference on Computer Vision. Cham: Springer, 2020: 741-757.
[27] ZHANG L, ZHAO H, MENG Q, et al. Beijing ZKJ-NPU speaker verification system for VoxCeleb Speaker Recognition Challenge 2021[J]. arXiv:2109.03568, 2021.
[28] KIM C, STERN R M. Power-normalized cepstral coefficients (PNCC) for robust speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(7): 1315-1329.
[29] ZEINELDEEN M, XU J, LüSCHER C, et al. Improving the training recipe for a robust conformer-based hybrid model[J]. arXiv:2206.12955, 2022.
[30] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[J]. arXiv:1901.02860, 2019.
[31] GAO Z, SONG Y, MCLOUGHLIN I, et al. Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system[C]//Proceedings of INTERSPEECH, 2019: 361-365.
[32] TANG Y, DING G, HUANG J, et al. Deep speaker embedding learning with multi-level pooling for text-independent speaker verification[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: 6116-6120.
[33] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset[J]. arXiv:1706.08612, 2017.
[34] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: deep speaker recognition[J]. arXiv:1806.05622, 2018.
[35] BROWN A, HUH J, NAGRANI A, et al. Playing a part: speaker verification at the movies[C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: 6174-6178.
[36] MCLAREN M, FERRER L, CASTAN D, et al. The speakers in the wild (SITW) speaker recognition database[C]//Proceedings of INTERSPEECH, 2016: 818-822.
[37] FALCON W. Pytorch lightning[EB/OL]. [2022-12-10]. https://github.com/PyTorchLightning/pytorch-lightning3.6.
[38] ZHANG B B, WU D, YANG C, et al. WeNet: production first and production ready end-to-end speech recognition toolkit[J]. arXiv:2102.01547, 2021.
[39] PARK, DANIEL S. SpecAugment on large scale datasets[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
[40] NIST speaker recognition evaluation 2016[EB/OL]. [2022-12-10]. https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016/.
[41] HIGUCHI Y, INAGUMA H, WATANABE S, et al. Improved Mask-CTC for non-autoregressive end-to-end ASR[C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: 8363-8367.