Multi-Modal Emotion Recognition Based on Transformer-ESIM Attention Mechanism

doi:10.3778/j.issn.1002-8331.2010-0463

Abstract

Abstract: To improve the accuracy of multi-modal emotion recognition based on speech and text fusion, an emotion recognition method based on Transformer-ESIM（Transformer-enhanced sequential inference model）attention mechanism is proposed. Due to the traditional recurrent neural network has a long-term dependence on the feature extraction of speech and text sequences, and its own sequence attributes cannot capture long-distance features, the multi-head attention mechanism of Transformer coding layer is used to parallelize the sequence, improve the processing speed, solve the sequence distance limit, extract the emotional semantic information in the sequence, and obtain the voice and text sequence deep emotional semantic coding. Then, the similarity features between speech and text are calculated by ESIM interactive attention mechanism to realize the alignment of speech and text modes, solve the problem of modal interaction which is ignored by the direct fusion of multi-modal characteristics, and improve the comprehension and generalization ability of the model to emotional semantics. This method is tested on IEMOCAP dataset. The results show that the classification accuracy of emotion recognition can reach 72.6%. Compared with other mainstream multi-modal emotion recognition methods, each index has been improved.

Key words: multi-modal emotion recognition, Transformer coding layer, multi-head attention mechanism, interactive attention

摘要： 为了提高语音和文本融合的情绪识别准确率，提出一种基于Transformer-ESIM（Transformer-enhanced sequential inference model）注意力机制的多模态情绪识别方法。传统循环神经网络在语音和文本序列特征提取时存在长期依赖性，其自身顺序属性无法捕获长距离特征，因此采用Transformer编码层的多头注意力机制对序列进行并行化处理，解决了序列距离限制，能充分提取序列内的情感语义信息，获取语音和文本序列的深层情感语义编码，同时提高处理速度；通过ESIM交互注意力机制计算语音和文本之间的相似特征，实现语音和文本模态的对齐，解决了多模态特征直接融合而忽视的模态间交互问题，提高模型对情感语义的理解和泛化能力。该方法在IEMOCAP数据集上进行实验测试，实验结果表明，情绪识别分类准确率可达72.6%，和其他主流的多模态情绪识别方法相比各项指标都得到了明显的提升。

关键词: 多模态情绪识别, Transformer编码层, 多头注意力机制, 交互注意力

XU Zhijing, GAO Shan. Multi-Modal Emotion Recognition Based on Transformer-ESIM Attention Mechanism[J]. Computer Engineering and Applications, 2022, 58(10): 132-138.

徐志京, 高姗. 基于Transformer-ESIM注意力机制的多模态情绪识别[J]. 计算机工程与应用, 2022, 58(10): 132-138.

References

[1] HAN K，YU D，TASHEV I.Speech emotion recognition using deep neural network and extreme learning machine[C]//15th Annual Conference of the International Speech Communication Association，2014：223-227.
[2] LEE J，TASHEV I.High-level feature representation using recurrent neural network for speech emotion recognition[C]//16th Annual Conference of the International Speech Communication Association，2015：1-4.
[3] NEUMANN M，VU N T.Attentive convolutional neural network-based speech emotion recognition：a study on the impact of input features signal length，and acted speech[C]//18th Annual Conference of the International Speech Communication Association，2017：1263-1267.
[4] TASHEV I J，WANG Z Q，GODIN K.Speech emotion recognition based on Gaussian mixture models and deep neural networks[C]//2017 Information Theory and Applications Workshop，2017：1-4.
[5] MUSTAQEEM Y，SAJJAD M，KWON S.Clustering-based speech emotion recognition by incorporating learned features and deep Bi-LSTM[J].IEEE Access，2020，8：79861-79875.
[6] ZADEH A，CHEN M，PORIA S，et al.Tensor fusion network for multimodal sentiment analysis[C]//2017 Conference on Empirical Methods in Natural Language Processing，2017：1103-1114.
[7] JIN Q，LI C，CHEN S，et al.Speech emotion recognition with acoustic and lexical features[C]//2015 IEEE International Conference on Acoustics，Speech and Signal Processing，2015：4749-4753.
[8] SAHAY S，KUMAR S H，XIA R，et al.Multimodal relational tensor network for sentiment and emotion classification[C]//Grand Challenge & Workshop on Human Multimodal Language，2018.
[9] AKHTAR M S，CHAUHAN D S，GHOSAL D，et al.Multi-task learning for multi-modal emotion recognition and sentiment analysis[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2019：370-379.
[10] ZHANG B，KHURRAM S，PROVOST E M.Exploiting acoustic and lexical properties of phonemes to recognize valence from speech[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：5871-5875.
[11] PORIA S，MAJUMDER N，HAZARIKA D，et al.Multimodal sentiment analysis：addressing key issues and setting up the baselines[J].IEEE Intelligent Systems，2018，33（6）：17-25.
[12] GAMAGE K W，SETHU V，AMBIKAIRAJAH E.Salience based lexical features for emotion recognition[C]//2017 IEEE International Conference on Acoustics，Speech and Signal Processing，2017：5830-5834.
[13] SEBASTIAN J，PIERUCCI P.Fusion techniques for utterance-level emotion recognition combining speech and transcripts[C]//20th Annual Conference of the International Speech Communication Association，2019：51-55.
[14] PEPINO L，RIERA P，FERRER L，et al.Fusion approaches for emotion recognition from speech using acoustic and text-based features[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：6484-6488.
[15] YOON S，BYUN S，JUNG K.Multimodal speech emotion recognition using audio and text[C]//2018 IEEE Spoken Language Technology Workshop，Athens，2018：112-118.