Cross-Modal Transformer Combination Model for Sentiment Analysis

doi:10.3778/j.issn.1002-8331.2302-0238

Abstract

Abstract: The Transformer-based end-to-end combination deep learning model is the mainstream model of multimodal sentiment analysis. In view of the lack of sentiment feature extraction ability of low-resource modal data, the difference of feature scales of non-aligned data in different modals, which lead to the loss of key feature information in the alignment and fusion process, and the unreliable multimodal long-term dependency mechanism caused by the parallel processing of multimodal data by the traditional attention model, this paper proposes an sentiment analysis model LAACMT based on lightweight attention aggregation module and cross-modal Transformer, which can use multimodal non-aligned data to perform binary classification and multiclass classification tasks. The model proposes to extract low-resource modal information using gated recurrent unit (GRU) and improved feature extraction algorithm, proposes positional encoding and convolution scaling methods for aligning multimodal contexts, proposes a multimodal multi-head attention mechanism to fuse aligned multimodal data and establishes a reliable cross-modal long-term dependency mechanism. The experimental results of the model on CMU-MOSI, which contains three modals of non-aligned dataset including text, voice and video, show that the performance evaluation index of the model has been steadily improved compared with SOTA, in which Acc7 has been improved by 3.96%, Acc2 has been improved by 4.08%, and F1 score has been improved by 3.35%. The results of ablation study show that the model proposed in this paper solves the problems in multimodal sentiment analysis, reduces the complexity of the multimodal sentiment analysis model based on Transformer, improves the performance of the model, and avoids over-fitting problems.

Key words: multimodal sentiment analysis, lightweight attentive aggregation module, cross-modal Transformer, gated recurrent unit, cross-modal multi-head attention mechanism

摘要： 基于Transformer的端到端组合深度学习模型是多模态情感分析的主流模型。针对相关工作中此类模型存在的低资源（low-resource）模态数据的情感特征提取能力不足、不同模态非对齐数据的特征尺度差异导致对齐融合过程中易丢失关键特征信息、基础注意力模型并行处理多模态数据导致多模态长期依赖机制不可靠的问题，提出了一种基于轻量级注意力聚合模块与跨模态Transformer的能使用多模态非对齐数据执行二分类和多分类任务的多模态情感分析模型LAACMT。LAACMT模型提出采用门控循环单元与改进的特征提取算法提取低资源模态信息，提出位置编码配合卷积放缩方法用于对齐多模态语境，提出跨模态多头注意力机制融合已对齐的多模态数据并建立可靠的跨模态长期依赖机制。LAACMT模型在包含文本、语音和视频的三种模态非对齐数据集CMU-MOSI上的实验结果表明该模型的性能评价指标较SOTA有稳定提升。其中Acc7提升了3.96%、Acc2提升了4.08%、F1分数提升了3.35%。消融实验结果数据证明所提模型解决了多模态情感分析相关工作中存在的问题，降低了基于Transformer的多模态情感分析模型的复杂度，提升了模型性能的同时避免了过拟合问题。

关键词: 多模态情感分析, 轻量级注意力聚合模块, 跨模态Transformer, 门控循环单元, 跨模态多头注意力机制

WANG Liang, WANG Yi, WANG Jun. Cross-Modal Transformer Combination Model for Sentiment Analysis[J]. Computer Engineering and Applications, 2024, 60(13): 124-135.

王亮, 王屹, 王军. 情感分析的跨模态Transformer组合模型[J]. 计算机工程与应用, 2024, 60(13): 124-135.

References

[1] KUMAR A, VEPA J. Gated mechanism for attention based multi modal sentiment analysis[C]//Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, 2020: 4477-4481.
[2] 张亚洲, 戎璐, 宋大为, 等. 多模态情感分析研究综述[J]. 模式识别与人工智能, 2020, 33(5): 426-438.
ZHANG Y Z, RONG L, SONG D W, et al. A review of multimodal sentiment analysis[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 426-438.
[3] MENG Y, HUANG J X, ZHANG Y, et al. Generating training data with language models: towards zero-shot language understanding[J/OL]. (2022-10-12)[2023-01-10]. https://arxiv.org/abs/2202.04538v2.
[4] ZHANG F, LI X C, LIM C P, et al. Deep emotional arousal network for multimodal sentiment analysis and emotion recognition[J]. Information Fusion, 2022, 5(7): 88-91.
[5] YANG L, NA J C, YU J F. Cross-modal multitask Transformer for end-to-end multimodal aspect-based sentiment analysis[J]. Information Processing and Management, 2022, 4(8): 59-64.
[6] YANG M P, LI Y Y, ZHANG H. GME-Dialogue-NET: gated multi-modal sentiment analysis model based on fusion mechanism[J]. Academic Journal of Computing & Information Science, 2021, 5(3): 4-12.
[7] XIAO G R, TU G, ZHENG L, et al. Multimodality sentiment analysis in social Internet of things based on hierarchical attentions and CSAT-TCN with MBM network[J]. IEEE Internet of Things Journal, 2021, 6(5): 8-24.
[8] HUDDAR M G, SANNAKKI S, RAJPUROHIT V. Attention-based multi-modal sentiment analysis and emotion detection in conversation using RNN[J]. International Journal of Interactive Multimedia and Artificial Intelligence, 2021, 7(8): 6-12.
[9] HUDDAR M G, SANNAKKI S, RAJPUROHIT V. Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification[J]. International Journal of Multimedia Information Retrieval, 2019, 2(3): 9-11.
[10] WANG Y K, CHEN X H, CAO L L, et al. Multimodal token fusion for vision Transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022: 12176-12185.
[11] 杨杨, 詹德川, 姜远, 等. 可靠多模态学习综述[J]. 软件学报, 2021, 32(4): 1067-1081.
YANG Y, ZHAN D C, JIANG Y, et al. A survey of reliable multimodal learning[J]. Journal of Software, 2021, 32(4): 1067-1081.
[12] ARJMAND M, DOUSTI M, MORADI H. TEASEL: a Transformer-based speech-prefixed language model[J/OL]. (2021-09-12)[2022-11-13]. https://arxiv.org/abs/2109.05522v1.
[13] TAN H H, BANSAL M. Lxmert: learning cross-modality encoder representations from Transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019: 5100-5111.
[14] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797.
[15] RAHMAN W, HASAN M, LEE S W, et al. Integrating multimodal information in large pretrained Transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 2359-2369.
[16] GUO X D, WANG Y D, MIAO Z J, et al. ER-MRL: emotion recognition based on multimodal representation learning[C]//Proceedings of the 2022 12th International Conference on Information Science and Technology (ICIST), Kaifeng, China, 2022: 421-428.
[17] BERREBBI D, SHI J T, YAN B, et al. Combining spectral and self-supervised features for low-resource speech recognition and translation[J/OL]. (2022-04-18) [2022-11-15]. https://arxiv.org/abs/2204.02470v2.
[18] KIKUTSUJI T, MORI Y, OKAZAKI K, et al. Explaining reaction coordinates of alanine dipeptide isomerization obtained from deep neural networks using explainable artificial intelligence[J/OL]. (2022-04-01)[2022-11-18]. https://arxiv.org/abs/2202.07276v3.
[19] BASEVSKI A, ZHOU H, MOHAMED A R, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[J/OL]. (2020-10-22)[2022-12-13]. https://arxiv.org/abs/2006.11477.
[20] AKHTAR M S, CHAUHAN D S, EKBAL A. A deep multi-task contextual attention framework for multi-modal affect analysis[J]. ACM Transactions on Knowledge Discovery from Data, 2020, 5(9): 14-17.
[21] LUPPINO L T, HANSEN M A, KAMPFFMEYER M, et al. Code-aligned autoencoders for unsupervised change detection in multimodal remote sensing images[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 60-72.
[22] HUANG J, LIN Z H, YANG Z G, et al. Temporal graph convolutional network for nultimodal sentiment analysis[C]// Proceedings of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, New York, NY, USA, 2021: 239-247.
[23] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020: 1122-1131.
[24] CHAUHAN D S, EKBAL A, BHATTACHARYYA P. An efficient fusion mechanism for multimodal low-resource setting[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 2022: 2583-2588.
[25] FU Z W, LIU F, XU Q, et al. NHFNET: a non-homogeneous fusion network for multimodal sentiment analysis[C]//Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, China, 2022: 1-6.
[26] AL-AZANI S, EI S M, EI A. Enhanced video analytics for sentiment analysis based on fusing textual, auditory and visual information[J]. IEEE Access, 2020: 136843-136857.
[27] 杜鹏飞, 李小勇, 高雅丽. 多模态视觉语言表征学习研究综述[J]. 软件学报, 2021, 32(2): 327-348.
DU P F, LI X Y, GAO Y L. A survey of multimodal visual language representation learning[J]. Journal of Software, 2021, 32(2): 327-348.
[28] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019: 4171-4186.
[29] 朱张莉, 饶元, 吴渊, 等. 注意力机制在深度学习中的研究进展[J]. 中文信息学报, 2019, 33(6): 1-11.
ZHU Z L, RAO Y, WU Y, et al. Research progress of attention mechanism in deep learning[J]. Journal of Chinese Information Processing, 2019, 33(6): 1-11.
[30] XU H F, GENABITH J V, XIONG D Y, et al. Learning source phrase representations for neural machine translation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 386-396.
[31] VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is all you need[J/OL]. (2017-12-06)[2022-12-08]. https://arxiv.org/abs/1706.03762v5.
[32] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J/OL]. (2016-08-12)[2022-12-06]. https://arxiv.org/abs/1606.06259.
[33] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, 2021: 9180-9192.
[34] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, New York, NY, USA, 2021: 6-15.
[35] WANG Z L, WAN Z H, WAN X J. TransModality: an End2End fusion method with Transformer for multimodal sentiment analysis[C]//Proceedings of The Web Conference 2020, New York, NY, USA, 2020: 2514-2520.
[36] SUN H, WANG H Y, LIU J Q, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation[C]//Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 2022: 3722-3729.
[37] YANG K C, XU H, GAO K. CM-BERT: cross-modal BERT for text-audio sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020: 521-528.