[1] HANSEN J H L, HASAN T. Speaker recognition by machines and humans: a tutorial review[J]. IEEE Signal Processing Magazine, 2015, 32(6): 74-99.
[2] VARIANI E, LEI X, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014: 4052-4056.
[3] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-vectors: robust DNN embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 5329-5333.
[4] CHUNG J S, HUH J, MUN S, et al. In defence of metric learning for speaker recognition[J]. arXiv:2003.11982, 2020.
[5] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[6] GAO S H, CHENG M M, ZHAO K, et al. Res2net: a new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(2): 652-662.
[7] DESPLANQUES B, THIENPONDT J, DEMUYNCK K. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification[J]. arXiv:2005.07143, 2020.
[8] THIENPONDT J, DESPLANQUES B, DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D resnets to enhance speaker verification[J]. arXiv:2104.02370, 2021.
[9] LIU T, DAS R K, LEE K A, et al. MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances[C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: 7517-7521.
[10] ZHAO M, MA Y, LIU M, et al. The SpeakIn system for VoxCeleb speaker recognition challange 2021[J]. arXiv:2109.
01989, 2021.
[11] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[12] 陈志高, 李鹏, 肖润秋, 等. 文本无关说话人识别的一种多尺度特征提取方法[J]. 电子与信息学报, 2021, 43(11): 3266-3271.
CHEN Z G, LI P, XIAO R Q, et al. A multiscale feature extraction method for text-independent speaker recognition[J]. Journal of Electronics & Information Technology, 2021, 43(11): 3266-3271.
[13] 邓力洪, 邓飞, 张葛祥, 等. 改进Res2Net的多尺度端到端说话人识别系统[J]. 计算机工程与应用, 2023, 59(24): 110-120.
DENG L H, DENG F, ZHANG G X, et al. Multi-scale end-to-end speaker recognition system based on improved Res2Net[J]. Computer Engineering and Applications , 2023, 59(24): 110-120.
[14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[15] ZHOU H, ZHANG S, PENG J, et al. Informer: beyond efficient transformer for long sequence time-series forecasting[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 11106-11115.
[16] LAVRENTYEVA G, NOVOSELOV S, VOLOKHOV V, et al. STC speaker recognition system for the NIST SRE 2021[C]//Proceedings of the Speaker and Language Recognition Workshop (Odyssey 2022), 2022: 354-361.
[17] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[18] BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems, 2020: 12449-12460.
[19] WANG R, AO J, ZHOU L, et al. Multi-view self-attention based transformer for speaker recognition[C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: 6732-6736.
[20] CHEN S, WANG C, CHEN Z, et al. WavLM: large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1-14.
[21] GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented transformer for speech recognition[J]. arXiv:2005.08100, 2020.
[22] KOIZUMI Y, KARITA S, WISDOM S, et al. DF-Conformer: integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement[C]//2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021: 161-165.
[23] CHEN S, WU Y, CHEN Z, et al. Continuous speech separation with conformer[C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: 5749-5753.
[24] ZHANG Y, LV Z, WU H, et al. MFA-Conformer: multi-scale feature aggregation conformer for automatic speaker verification[J]. arXiv:2203.15249, 2022.
[25] LIAO D, JIANG T, WANG F, et al. Towards a unified conformer structure: from ASR to ASV Task[J]. arXiv:2211.07201, 2022.
[26] DENG J, GUO J, LIU T, et al. Sub-center ArcFace: boosting face recognition by large-scale noisy web faces[C]//European Conference on Computer Vision. Cham: Springer, 2020: 741-757.
[27] ZHANG L, ZHAO H, MENG Q, et al. Beijing ZKJ-NPU speaker verification system for VoxCeleb Speaker Recognition Challenge 2021[J]. arXiv:2109.03568, 2021.
[28] KIM C, STERN R M. Power-normalized cepstral coefficients (PNCC) for robust speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(7): 1315-1329.
[29] ZEINELDEEN M, XU J, LüSCHER C, et al. Improving the training recipe for a robust conformer-based hybrid model[J]. arXiv:2206.12955, 2022.
[30] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[J]. arXiv:1901.02860, 2019.
[31] GAO Z, SONG Y, MCLOUGHLIN I, et al. Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system[C]//Proceedings of INTERSPEECH, 2019: 361-365.
[32] TANG Y, DING G, HUANG J, et al. Deep speaker embedding learning with multi-level pooling for text-independent speaker verification[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: 6116-6120.
[33] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset[J]. arXiv:1706.08612, 2017.
[34] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: deep speaker recognition[J]. arXiv:1806.05622, 2018.
[35] BROWN A, HUH J, NAGRANI A, et al. Playing a part: speaker verification at the movies[C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: 6174-6178.
[36] MCLAREN M, FERRER L, CASTAN D, et al. The speakers in the wild (SITW) speaker recognition database[C]//Proceedings of INTERSPEECH, 2016: 818-822.
[37] FALCON W. Pytorch lightning[EB/OL]. [2022-12-10]. https://github.com/PyTorchLightning/pytorch-lightning3.6.
[38] ZHANG B B, WU D, YANG C, et al. WeNet: production first and production ready end-to-end speech recognition toolkit[J]. arXiv:2102.01547, 2021.
[39] PARK, DANIEL S. SpecAugment on large scale datasets[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
[40] NIST speaker recognition evaluation 2016[EB/OL]. [2022-12-10]. https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016/.
[41] HIGUCHI Y, INAGUMA H, WATANABE S, et al. Improved Mask-CTC for non-autoregressive end-to-end ASR[C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: 8363-8367. |