计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (10): 132-138.DOI: 10.3778/j.issn.1002-8331.2010-0463

• 模式识别与人工智能 • 上一篇    下一篇

基于Transformer-ESIM注意力机制的多模态情绪识别

徐志京,高姗   

  1. 上海海事大学 信息工程学院,上海 201306
  • 出版日期:2022-05-15 发布日期:2022-05-15

Multi-Modal Emotion Recognition Based on Transformer-ESIM Attention Mechanism

XU Zhijing, GAO Shan   

  1. College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
  • Online:2022-05-15 Published:2022-05-15

摘要: 为了提高语音和文本融合的情绪识别准确率,提出一种基于Transformer-ESIM(Transformer-enhanced sequential inference model)注意力机制的多模态情绪识别方法。传统循环神经网络在语音和文本序列特征提取时存在长期依赖性,其自身顺序属性无法捕获长距离特征,因此采用Transformer编码层的多头注意力机制对序列进行并行化处理,解决了序列距离限制,能充分提取序列内的情感语义信息,获取语音和文本序列的深层情感语义编码,同时提高处理速度;通过ESIM交互注意力机制计算语音和文本之间的相似特征,实现语音和文本模态的对齐,解决了多模态特征直接融合而忽视的模态间交互问题,提高模型对情感语义的理解和泛化能力。该方法在IEMOCAP数据集上进行实验测试,实验结果表明,情绪识别分类准确率可达72.6%,和其他主流的多模态情绪识别方法相比各项指标都得到了明显的提升。

关键词: 多模态情绪识别, Transformer编码层, 多头注意力机制, 交互注意力

Abstract: To improve the accuracy of multi-modal emotion recognition based on speech and text fusion, an emotion recognition method based on Transformer-ESIM(Transformer-enhanced sequential inference model)attention mechanism is proposed. Due to the traditional recurrent neural network has a long-term dependence on the feature extraction of speech and text sequences, and its own sequence attributes cannot capture long-distance features, the multi-head attention mechanism of Transformer coding layer is used to parallelize the sequence, improve the processing speed, solve the sequence distance limit, extract the emotional semantic information in the sequence, and obtain the voice and text sequence deep emotional semantic coding. Then, the similarity features between speech and text are calculated by ESIM interactive attention mechanism to realize the alignment of speech and text modes, solve the problem of modal interaction which is ignored by the direct fusion of multi-modal characteristics, and improve the comprehension and generalization ability of the model to emotional semantics. This method is tested on IEMOCAP dataset. The results show that the classification accuracy of emotion recognition can reach 72.6%. Compared with other mainstream multi-modal emotion recognition methods, each index has been improved.

Key words: multi-modal emotion recognition, Transformer coding layer, multi-head attention mechanism, interactive attention