计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (8): 127-137.DOI: 10.3778/j.issn.1002-8331.2111-0542

• 模式识别与人工智能 • 上一篇    下一篇

基于级联双通道分阶段融合的双模态情感识别

徐志京,刘霞   

  1. 上海海事大学 信息工程学院,上海 201306
  • 出版日期:2023-04-15 发布日期:2023-04-15

Bimodal Emotion Recognition Model Based on Cascaded Two Channel Phased Fusion

XU Zhijing, LIU Xia   

  1. College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
  • Online:2023-04-15 Published:2023-04-15

摘要: 为充分提取文本和语音双模态深层情感特征,解决模态间有效交互融合的问题,提高情感识别准确率,提出了基于级联双通道分阶段融合(cascade two channel and phased fusion,CTC-PF)的双模态情感识别模型。设计级联顺序注意力编码器(cascaded sequential attention-Encoder,CSA-Encoder)对长距离语音情感序列信息进行并行化计算,提取深层语音情感特征;提出情感领域级联编码器(affective field cascade-Encoder,AFC-Encoder),提高模型的全局和局部文本理解能力,解决文本关键情感特征稀疏的问题。两个级联通道完成语音和文本信息的特征提取之后,利用协同注意力机制对两者的重要情感特征进行交互融合,降低对齐操作成本,然后采用哈达玛点积对其进行二次融合,捕获差异性特征,分阶段融合实现不同时间步长模态序列间的信息交互,解决双模态情感信息交互不足的问题。模型在IEMOCAP数据集上进行分类实验,结果表明,情感识别准确率可达79.4%,F1值可达79.0%,相比现有主流方法有明显提升,证明了该模型在语音和文本融合情感识别上的优越性。

关键词: 双模态情绪识别, 级联编码器, 分阶段融合, 信息交互

Abstract: In order to fully extract the deep emotional features of text and speech and solve the problem of effective interactive fusion between this two modals, a bimodel emotion recognition model based on cascade two channel and phased fusion(CTC-PF) is proposed. First, the cascaded sequential attention-Encoder(CSA-Encoder) is designed to compute the long-distance speech emotion sequence information in parallel and extract the deep-level speech emotion feature. Besides, the affective field cascade-Encoder(AFC-Encoder) is designed to improve the text feature extractor’s global and local text understanding abilities and solve the problem of sparse key emotional features of text. After this two cascaded channels model completing the feature extraction of speech and text information, the collaborative attention mechanism is used to interactively integrate the important emotional features of this two modals, which aim to reduce the cost of alignment operations, and then the Hadamard dot product is designed to perform secondary fusion to capture the difference features and solve the problem of insufficiency of emotional information interaction between this two modals, phased fusion realizes the information interaction between modal sequences of different time steps. The emotion recognition model performs classification experiments on the IEMOCAP dataset. The results show that the accuracy of emotion recognition can reach 79.4%, and the F1-score can reach 79.0%. Compared with the existing mainstream methods, the performance of the proposed model is significantly improved, which proves the proposed fusion model is in a high superiority of speech and text bimodal emotion recognition.

Key words: bimodal emotion recognition, cascaded encoder, phased fusion, information interaction