Speaker Verification Based on Teacher-Free Knowledge Distillation Model

doi:10.3778/j.issn.1002-8331.2012-0298

Abstract

Abstract: The text-independent speaker verification models achieve powerful performance through complex network structure and changeable feature extraction methods, however, they need huge memory consumption and incremental computing costs, which makes it difficult to deploy the models on resource-limited hardware facilities. Focusing on this problem, this research takes advantage of the teacher-free knowledge distillation（Tf-KD） model, which can bring one hundred percent classification accuracy and smoothing output probability distribution to establish a teacher-free speaker verification（Tf-SV） model based on a lightweight residual network. At the same time, the spatial-shared and channel-wise dynamic rectified linear units function and the additive angular margin loss function（AAM-Softmax） are introduced, which greatly improve the performance of the proposed model in terms of feature expression, training efficiency and compressed model’s capabilities, and finally achieve the aim of deploying the given Tf-SV model on limited-storage or limited-computing facilities. Based on the VoxCeleb1 dataset, experimental results show that the equal error rate（EER） of the Tf-SV model is reduced to 3.4%. This is a significant improvement over the published results, and demonstrates the effectiveness of the compression model on the speaker verification task.

Key words: teacher-free knowledge distillation, dynamic rectified linear units function, additive angular margin loss function, model compression, speaker verification

摘要： 无文本说话人确认模型通过复杂的网络结构和多变的特征提取方式来获得必要的性能，然而这会产生巨大的内存消耗和递增的计算成本，导致模型难以在资源有限的硬件设施上部署。针对该问题，利用虚拟教师蒸馏模型（teacher-free knowledge distillation，Tf-KD）可以带来百分之百的分类正确率、平滑的输出概率分布的优势，在轻量级残差网络的基础上构建虚拟教师说话人确认模型（teacher-free speaker verification model，Tf-SV）。同时引入空间共享而通道分离的动态激活函数和附加角裕度损失函数，使所提模型在特征表达、训练效率以及模型压缩后性能等方面的水平得到极大提升，最终达到无文本说话人确认模型能够在存储或者计算资源有限设备上部署的目的。基于VoxCeleb1数据集的实验表明，虚拟教师说话人确认模型的等错误率（EER）降低到3.4%。与已有成果相比，指标有明显提升，证明了在说话人确认任务上所提压缩模型的有效性。

关键词: 虚拟教师知识蒸馏, 动态激活函数, 附加角裕度损失函数, 模型压缩, 说话人确认

XIAO Jinzhuang, LI Ruipeng, JI Mengmeng. Speaker Verification Based on Teacher-Free Knowledge Distillation Model[J]. Computer Engineering and Applications, 2022, 58(8): 198-203.

肖金壮, 李瑞鹏, 纪盟盟. 基于虚拟教师蒸馏模型的说话人确认方法[J]. 计算机工程与应用, 2022, 58(8): 198-203.

References

[1] 曾春艳，马超峰，王志锋，等.深度学习框架下的说话人识别研究综述[J].计算机工程与应用，2020，56（7）：8-16.
ZENG C Y，MA C F，WANG Z F.Survey of speaker recognition in deep learning framework[J].Computer Engineering and Applications，2020，56（7）：8-16.
[2] CAMPBELL J.Speaker recognition：a tutorial[J].Proceedings of the IEEE，1997，85（9）：1437-1462.
[3] REYNOLDS D A，QUATIERI T F，DUNN R B.Speaker verification using adapted Gaussian mixture models[J].Digital Signal Processing，2000，10（3）：19-41.
[4] DEHAK N，KENNY P J，DEHAK R，et al.Front-end factor analysis for speaker verification[J].IEEE Transactions on Audio Speech & Language Processing，2011，19（4）：788-798.
[5] VARIANI E，LEI X，MCDERMOTT E，et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2014：4052-4056.
[6] NAGRANI A，CHUNG J S，ZISSERMAN A.VoxCeleb：a large-scale speaker identification dataset[C]//Proc of Interspeech，2017：2616-2620.
[7] CHUNG J S，NAGRANI A，ZISSERMAN A.VoxCeleb2：deep speaker recognition[C]//Proc of Interspeech，2018：1086-1090.
[8] ZHANG C L，KOISHIDA K，HANSEN J H L.Text-independent speaker verification based on triplet convolutional neural network embeddings[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2018，26（9）：1633-1644.
[9] BIAN T Y，CHEN F Z，XU L.Self-attention based speaker recognition using cluster-range loss[J].Neurocomputing，2019，368：59-68.
[10] WANG S，HUANG Z L，QIAN Y M，et al.Discriminative neural embedding learning for short-duration text-independent speaker verification[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2019，27（11）：1686-1696.
[11] CHOUDHARY T，MISHRA V，GOSWAMI A，et al.A comprehensive survey on model compression and acceleration[J].Artificial Intelligence Review，2020，53：5113-5155.
[12] 李江昀，赵义凯，薛卓尔，等.深度神经网络模型压缩综述[J].工程科学学报，2019，41（10）：1229-1239.
LI J Y，ZHAO Y K，XUE Z E，et al.A survey of model compression for deep neural networks[J].Chinese Journal of Engineering，2019，41（10）：1229-1239.
[13] ZHANG C，CHEN W，XU C.Depthwise separable convolutions for short utterance speaker identification[C]//2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference（ITAIC），2019：962-966.
[14] GEOFFREY H，ORIOL V，JEFF D.Distilling the knowledge in a neural network[J].arXiv：1503.02531，2015.
[15] MINGOTE V，MIGUEL A，RIBAS D，et al.Knowledge distillation and random erasing data augmentation for text-dependent speaker verification[C]//IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：6824-6828.
[16] WANG S，YANG Y，WANG T，et al.Knowledge distillation for small foot-print deep speaker embedding[C]//IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2019：6021-6025.
[17] LI Y，TAY F E，LI G，et al.Revisiting knowledge distillation via label smoothing regularization[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2020：3902-3910.
[18] DENG J K，GUO J，ZAFEIRIOU S.Arcface：additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2019：4690-4699.
[19] CHEN Y P，DAI X Y，LIU M C，et al.Dynamic ReLU[C]//European Conference on Computer Vision（ECCV），2020：351-367.
[20] 单传辉.深度单峰梯形神经网络[J].计算机工程与应用，2018，54（23）：7-13.
SHAN C H.Deep single-peaked trapezoid neural networks[J].Computer Engineering and Applications，2018，54（23）：7-13.
[21] LIU Y，WANG X，WANG L，et al.A modified leaky ReLU scheme（MLRS） for topology optimization with multiple materials[J].Applied Mathematics and Computation，2019，352：188-204.
[22] ZHAO M H，ZHONG S S，FU X Y，et al.Deep residual networks with adaptively parametric rectifier linear units for fault diagnosis[J].IEEE Transactions on Industrial Electronics，2021，68（3）：2587-2597.
[23] MILLAR J B，VONWILLER J P，HARRINGTON J M，et al.The Australian national database of spoken language[C]//IEEE International Conference on Acoustics，1994：97-100.
[24] 胡政权，曾毓敏，宗原，等.说话人识别中MFCC参数提取的改进[J].计算机工程与应用，2014，50（7）：217-220.
HU Z Q，ZENG Y M，ZONG Y，et al.Improvement of MFCC parameters extraction in speaker recognition[J].Computer Engineering and Applications，2014，50（7）：217-220.
[25] CAI W，CHEN J，LI M.Exploring the encoding layer and loss function in end-to-end speaker and language recognition system[C]//The Speaker and Language Recognition Workshop，2018：74-81.