计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (16): 168-176.DOI: 10.3778/j.issn.1002-8331.2305-0049

• 模式识别与人工智能 • 上一篇    下一篇

C-BGA:结合对比学习的多模态语音情感识别网络

苗博瑞,许云峰,赵少杰,王嘉麟   

  1. 河北科技大学 信息科学与工程学院,石家庄 050000
  • 出版日期:2024-08-15 发布日期:2024-08-15

C-BGA: Multimodal Speech Emotion Recognition Network Combining Contrastive Learning

MIAO Borui, XU Yunfeng, ZHAO Shaojie, WANG Jialin   

  1. School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050000, China
  • Online:2024-08-15 Published:2024-08-15

摘要: 当前多模态语音情感识别(speech emotion recognition,SER)数据集规模较小,蕴含信息量较大,导致模型对各模态信息的拟合度不足,且无法挖掘出数据背后蕴含的信息。针对该问题,提出了基于对比学习的多模态语音情感分类网络。一方面在网络中引用跳连接(skip connections,SC)方法,有效解决了网络退化问题;另一方面借助对比学习(contrastive learning,CL)理论提出一种新的Loss计算方法,加快模型的拟合速度。模型在IEMOCAP数据集上进行实验,未加权精度(UA)为82.68%,加权精度(WA)为82.35%,实验结果表明了该模型的优越性。

关键词: 多模态, 语音情感识别, 对比学习, 注意力机制

Abstract: At present, the multimodal speech emotion recognition (SER) dataset is small in scale and contains a large amount of information, resulting in insufficient fitting of the model to each modal information, and the information behind the data cannot be excavated. Aiming at this problem, a multimodal speech emotion classification network based on contrastive learning is proposed. On the one hand, the method of skip connections (SC) is used in the network to effectively solve the problem of network degradation. On the other hand, a new Loss calculation method is proposed by means of contrastive learning (CL) theory to speed up the fitting speed of the model. The model is tested on the IEMOCAP dataset. The unweighted accuracy (UA) is 82.68%, and the weighted accuracy (WA) is 82.35%. According to the experimental results, the superiority of this model is demonstrated.

Key words: multimodal, speech emotion recognition, contrastive learning, attention mechanism