计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (24): 110-120.DOI: 10.3778/j.issn.1002-8331.2208-0085

• 模式识别与人工智能 • 上一篇    下一篇

改进Res2Net的多尺度端到端说话人识别系统

邓力洪,邓飞,张葛祥,杨强   

  1. 1.成都理工大学 计算机与网络安全学院(牛津布鲁克斯学院),成都 610059
    2.成都理工大学 人工智能研究中心,成都 610059
    3.成都信息工程大学 控制工程学院,成都 610059
  • 出版日期:2023-12-15 发布日期:2023-12-15

Multi-Scale End-to-End Speaker Recognition System Based on Improved Res2Net

DENG Lihong, DENG Fei, ZHANG Gexiang, YANG Qiang   

  1. 1.School of Computer and Network Security(Oxford Brookes Institute), Chengdu University of Technology, Chengdu 610059, China
    2.Artificial Intelligence Research Center, Chengdu University of Technology, Chengdu 610059, China
    3.School of Control Engineering, Chengdu University of Information Engineering, Chengdu 610059, China
  • Online:2023-12-15 Published:2023-12-15

摘要: 说话人识别系统中轻量卷积神经网络的特征提取能力弱、识别效果差。而为了提升特征提取能力,许多方法使用了更深、更宽、更复杂的网络结构,使得参数量和推理时间成倍增加。将目标检测任务中的轻量网络Res2Net引入到说话人识别任务中,验证了它在说话人识别任务中的有效性和鲁棒性。并改进提出了FullRes2Net,它拥有更多、更大的感受野组合。在几乎没有增加参数量的情况下,相比于Res2Net,性能提升了17%。同时,为了解决现有注意力方法存在的问题改善卷积本身的缺点,进一步提升卷积神经网络的特征提取能力,提出了混合时频通道注意力。它可以对音频特征的时间、频率、通道维度进行交互,捕捉特征间的依赖,从而有效增强卷积神经网络的特征提取能力。在Voxceleb数据集上进行了实验,结果表明提出的方法有效地提升了系统的特征提取能力和泛化能力,相较于Res2Net性能提升了34%,并优于使用复杂结构的先进说话人识别系统,是一种参数量更少、效率更高的端到端结构,适合在现实场景中的应用。

关键词: 说话人识别, 端到端, 注意力机制

Abstract: The feature extraction ability of lightweight convolutional neural networks in speaker recognition systems is weak and recognition is poor. And to improve the feature extraction ability, many methods use deeper, wider and more complex network structures, which make the number of parameters and inference time increase exponentially. This paper introduces Res2Net in target detection task to speaker recognition task, and verifies its effectiveness and robustness in speaker recognition task. And FullRes2Net is improved and proposed to have stronger multi-scale feature extraction capability without increasing the number of parameters, and 17% performance improvement compared to Res2Net. Meanwhile, in order to solve the problems of existing attention methods improve the shortcomings of convolution itself and further enhance the feature extraction ability of convolutional neural networks, mixed time-frequency channel attention is proposed. Experiments are conducted on the Voxceleb dataset, and the results show that the proposed method effectively improves the feature extraction ability and generalization ability of the system, with a 34% performance improvement compared to Res2Net, and outperforms advanced speaker recognition systems using complex structures, which is an end-to-end structure with fewer parameters and higher efficiency, suitable for applications in realistic scenarios.

Key words: speaker recognition, end-to-end, attention mechanisms