Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (10): 164-172.DOI: 10.3778/j.issn.1002-8331.2301-0080

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Multi-Model Fusion VoxSRC22 Speaker Diarization System

DU Yuxuan, ZHOU Ruohua   

  1. School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 102616, China
  • Online:2024-05-15 Published:2024-05-15

多模型融合的VoxSRC22说话人日志系统

杜雨轩,周若华   

  1. 北京建筑大学 电气与信息工程学院,北京 102616

Abstract: In order to effectively address the problem of speaker diarization, a novel speaker diarization method is proposed. The proposed method consists of six modules, including voice activity detection (VAD), speech enhancement, speaker embedding extractor, speaker clustering, overlapping speech detection (OSD), and result fusion. The application of speech enhancement techniques can improve the performance of voice activity detection. The effective combination of different speaker embedding extractors and clustering algorithms can further reduce speaker diarization error rate. The best performance is achieved by processing the overlapping speech after system fusion. Experimental results show that the performance of the proposed system outperforms the baseline by 72%, achieves a speaker diarization error rate (DER) of 5.48% and a Jaccard error rate (JER) of 32.10% on the VoxCeleb speaker recognition challenge (VoxSRC) 2022 evaluation set, ranking fourth.

Key words: speaker diarization, voice activity detection, speaker embedding, speaker cluster, result fusion

摘要: 为有效解决“谁在什么时候说话”的问题,提出一种说话人日志方法。该方法由六个模块组成,包括语音活动检测(voice activity detection,VAD)、语音增强、说话人嵌入提取器、说话人聚类、重叠语音检测(overlapping speech detection,OSD)和结果融合。利用语音增强技术可以改善语音活动检测的性能。有效地结合不同的说话人嵌入提取器和聚类算法可以进一步降低系统错误率。在系统融合后处理重叠语音展示了最佳结果。实验结果表明,最佳系统的性能相对基线提升了72%,并在VoxCeleb说话人识别挑战赛(VoxCeleb speaker recognition challenge,VoxSRC)2022评估集上分别实现了5.48%的说话人日志错误率(diarization error rate,DER)和32.10%的杰卡德错误率(Jaccard error rate,JER),排名第四。

关键词: 说话人日志, 语音活动检测, 声纹嵌入, 说话人聚类, 结果融合