Cross-Modal Semantic Alignment and Information Refinement for Multi-Modal Sentiment Analysis

doi:10.3778/j.issn.1002-8331.2307-0431

Abstract

Abstract: In order to solve the problems of heterogeneous gap, semantic gap and inability to effectively fuse modalities in multi-modal sentiment analysis, this paper proposes a new framework, a multi-modal sentiment analysis model CM-SAIR based on cross-modal Transformer for semantic alignment and information refinement, which can effectively solve problems such as multi-modal semantic misalignment and semantic noise, and achieve better interactive fusion of multi-modal data. Multi-modal feature embedding module (MFE) is used to enhance the emotional information of visual and audio modalities. A well-defined inter-modal semantic alignment module (ISA) is proposed for bimodal temporal dimensions alignment. Sentiment parsing and sentiment refinement are performed through an intra-modal information refinement module (IIR). Effective modal fusion is achieved through the multi-modal gated fusion module (MGF). Extensive experiments on popular multi-modal sentiment analysis datasets demonstrate the advantages of the CM-SAIR framework over state-of-the-art baselines.

Key words: multi-modal feature embedding, semantic alignment, information refinement, multi-modal gated fusion, multi-modal sentiment analysis

摘要： 为了解决多模态情感分析中存在异构鸿沟和语义鸿沟，以及模态无法有效融合等问题，提出了一个新的框架，基于跨模态Transformer的语义对齐和信息细化的多模态情感分析模型CM-SAIR（cross-modal semantic alignment and information refinement for multi-modal sentiment analysis），可以有效地解决多模态语义不对齐、语义噪声等问题，实现多模态数据更好地交互融合。使用多模态特征嵌入模块（multi-modal feature embedding，MFE）增强视觉和听觉模态的情感信息。通过一个定义良好的模态间语义对齐模块（inter-modal semantic alignment，ISA）进行双模态时间维度的对齐。通过一个模态内的信息细化模块（intra-modal information refinement，IIR）进行情感解析和情感细化。通过多模态门控融合模块（multi-modal gated fusion，MGF）实现模态的有效融合。在流行的多模态情感分析数据集上进行实验，证明了CM-SAIR框架与最先进的基线相比的优势。

关键词: 多模态特征嵌入, 语义对齐, 信息细化, 多模态门控融合, 多模态情感分析

DING Meirong, CHEN Hongye, ZENG Biqing. Cross-Modal Semantic Alignment and Information Refinement for Multi-Modal Sentiment Analysis[J]. Computer Engineering and Applications, 2024, 60(22): 114-125.

丁美荣, 陈鸿业, 曾碧卿. 跨模态语义对齐和信息细化的多模态情感分析[J]. 计算机工程与应用, 2024, 60(22): 114-125.

References

[1] CHEN F, LUO Z, XU Y, et al. Complementary fusion of multi-features and multi-modalities in sentiment analysis[J]. arXiv:1904.08138, 2019.
[2] ZHANG L, WANG S, LIU B. Deep learning for sentiment analysis: a survey[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2018, 8(4): e1253.
[3] XI C, LU G, YAN J. Multimodal sentiment analysis based on multi-head attention mechanism[C]//Proceedings of the 4th International Conference on Machine Learning and Soft Computing, 2020: 34-39.
[4] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, 2021: 6-15.
[5] KARPATHY A. The unreasonable effectiveness of recurrent neural networks[J]. Andrej Karpathy Blog, 2015, 21: 23.
[6] SHI X, CHEN Z, WANG H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[C]//Advances in Neural Information Processing Systems 28, 2015.
[7] TANG D, QIN B, LIU T. Document modeling with gated recurrent neural network for sentiment classification[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2015: 1422-1432.
[8] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707.07250, 2017.
[9] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 1122-1131.
[10] MAJUMDER N, HAZARIKA D, GELBUKH A, et al. Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124-133.
[11] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018.
[12] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you Need[C]//Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, 2017: 5998-6008.
[13] KUMAR A, VEPA J. Gated mechanism for attention based multi modal sentiment analysis[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 4477-4481.
[14] DELBROUCK J B, TITS N, DUPONT S. Modulated fusion using transformer for linguistic-acoustic emotion recognition[J]. arXiv:2010.02057, 2020.
[15] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021: 10790-10797.
[16] LEE J D M C K, TOUTANOVA K. Pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[17] ZADEH A, ZELLERS R, Pincus E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv:1606.06259, 2016.
[18] ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018: 2236-2246.
[19] PORIA S, CAMBRIA E, BAJPAI R, et al. A review of affective computing: from unimodal analysis to multimodal fusion[J]. Information Fusion, 2017, 37: 98-125.
[20] EBRAHIMI M, YAZDAVAR A H, SHETH A. Challenges of sentiment analysis for dynamic events[J]. IEEE Intelligent Systems, 2017, 32(5): 70-75.
[21] CAVALLARI S, ZHENG V W, CAI H, et al. Learning community embedding with community detection and node embedding on graphs[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017: 377-386.
[22] MIHALCEA R, GARIMELLA A. What men say, what women hear: finding gender-specific meaning shades[J]. IEEE Intelligent Systems, 2016, 31(4): 62-67.
[23] PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//Proceedings of the 2016 IEEE 16th International Conference on Data Mining, 2016: 439-448.
[24] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long papers), 2017: 873-883.
[25] CHATURVEDI I, SATAPATHY R, CAVALLARI S, et al. Fuzzy commonsense reasoning for multimodal sentiment analysis[J]. Pattern Recognition Letters, 2019, 125: 264-270.
[26] LIANG P P, LIU Z, ZADEH A, et al. Multimodal language analysis with recurrent multistage fusion[J]. arXiv:1808.03920, 2018.
[27] SAHAY S, KUMAR S H, XIA R, et al. Multimodal relational tensor network for sentiment and emotion classification[J]. arXiv:1806.02923, 2018.
[28] ZHENG J, ZHANG S, WANG X, et al. Multimodal representations learning based on mutual information maximization and minimization and identity embedding for multimodal sentiment analysis[J]. arXiv:2201.03969, 2022.
[29] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018.
[30] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019: 6558 -6569.
[31] CAMBRIA E, HAZARIKA D, PORIA S, et al. Benchmarking multimodal sentiment analysis[C]//Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing. Cham: Springer, 2017: 166-179.
[32] XU M, LIANG F, SU X, et al. CMJRT: cross-modal joint representation transformer for multimodal sentiment analysis[J]. IEEE Access, 2022, 10: 131671-131679.
[33] SUN L, LIAN Z, LIU B, et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis[J]. arXiv:2208.07589, 2022.