多通道交互下全局语义信息增强的多模态情感分析

doi:10.3778/j.issn.1002-8331.2406-0376

摘要/Abstract

摘要： 人类在沟通时常常会通过文本、音频和视觉等多种形式表达情感。如果只使用单一的方式判断情感，结果可能会有偏差，但结合多种线索可以更加全面地理解和探索信息。然而，之前的大多数多模态情感分析方法只是分析单个图文对帖子之间的情感联系，而忽略了数据集中每个图文对帖子之间的共现特征。针对上述问题，提出一种多通道交互下全局语义信息增强的多模态情感分析模型。设计一个文本引导的多通道交互模块，促进单个图文对中文本特征与图像对象视图和场景视图之间的交互；构建文本级图神经网络和文本属性级图神经网络学习单个模态和多个模态的全局共现特征；利用一个多源表征模块融合多种特征表示实现多模态融合。在公开的多模态情感分析数据集MVSA-Single、MVSA-Multiple和TumEmo上的大量实验证明，该模型优于一系列基线模型。

关键词: 多模态情感分析, 多通道交互, 图神经网络, 信息增强

Abstract: Humans often express emotion in multiple forms when communicating, including text, audio, and visuals. If only one modality is used to determine sentiment, the results may be biased; however, combining multiple cues allows for a more comprehensive understanding and exploration of the message. However, most previous multimodal sentiment analysis methods only examined the sentiment links between individual graphic pairs of posts, ignoring the co-occurring features within each graphic pair of posts in the dataset. To address the above problems, a multimodal sentiment analysis model for global semantic information enhancement under multi-channel interaction is proposed. Firstly, a text-guided multi-channel interaction module is designed to facilitate the interaction between text features in a single graphic pair and the image object view and scene view. Secondly, a text-level graph neural network and a textual attribute-level graph neural network are constructed to learn the global co-occurrence features of individual modalities and multiple modalities. Finally, a multisource representation module is utilized to fuse multiple feature representations to achieve multimodal fusion. Extensive experiments on the publicly available multimodal sentiment analysis datasets MVSA-Single, MVSA-Multiple, and TumEmo demonstrate that the model outperforms a range of baseline models.

Key words: multimodal sentiment analysis, multichannel interaction, graph neural networks, information enhancement

卜韵阳, 卜凡亮, 张志江. 多通道交互下全局语义信息增强的多模态情感分析[J]. 计算机工程与应用, 2025, 61(19): 137-146.

BU Yunyang, BU Fanliang, ZHANG Zhijiang. Multimodal Sentiment Analysis of Global Semantic Information Enhancement Under Multi-Channel Interaction[J]. Computer Engineering and Applications, 2025, 61(19): 137-146.

参考文献

[1] 王旭阳, 庞文倩, 赵丽婕. 多模态方面级情感分析的多视图交互学习网络[J]. 计算机工程与应用, 2024, 60(7): 92-100.
WANG X Y, PANG W Q, ZHAO L J. Multiview interaction learning network for multimodal aspect-level sentiment analysis[J]. Computer Engineering and Applications, 2024, 60(7): 92-100.
[2] 王亮, 王屹, 王军. 情感分析的跨模态Transformer组合模型[J]. 计算机工程与应用, 2024, 60(13): 124-135.
WANG L, WANG Y, WANG J. Cross-modal transformer combination model for sentiment analysis[J]. Computer Engineering and Applications, 2024, 60(13): 124-135.
[3] YANG X C, FENG S, ZHANG Y F, et al. Multimodal sentiment detection based on multi-channel graph neural networks[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 328-339.
[4] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[J]. arXiv:1609.02907, 2016.
[5] BASTINGS J, TITOV I, AZIZ W, et al. Graph convolutional encoders for syntax-aware neural machine translation[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 1957-1967.
[6] YAO L, MAO C S, LUO Y. Graph convolutional networks for text classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 7370-7377.
[7] WU J F, MAI S J, HU H F. Graph capsule aggregation for unaligned multimodal sequences[C]//Proceedings of the 2021 International Conference on Multimodal Interaction. New York: ACM, 2021: 521-529.
[8] LIAO W X, ZENG B, LIU J Q, et al. Image-text interaction graph neural network for image-text sentiment analysis[J]. Applied Intelligence, 2022, 52(10): 11184-11198.
[9] ZHAO T, PENG J J, HUANG Y S, et al. A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis[J]. Applied Intelligence, 2023, 53(24): 30455-30468.
[10] WANG H B, REN C, YU Z T. Multimodal sentiment analysis based on cross-instance graph neural networks[J]. Applied Intelligence, 2024, 54(4): 3403-3416.
[11] MAI S J, XING S L, HE J X, et al. Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(2): 1-24.
[12] HUANG F R, WEI K M, WENG J, et al. Attention-based modality-gated networks for image-text sentiment analysis[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(3): 1-19.
[13] AN J Y, ZAINON W M N W. Integrating color cues to improve multimodal sentiment analysis in social media[J]. Engineering Applications of Artificial Intelligence, 2023, 126: 106874.
[14] LIU H, LI K, FAN J P, et al. Social image-text sentiment classification with cross-modal consistency and knowledge distillation[J]. IEEE Transactions on Affective Computing, 2023, 14(4): 3332-3344.
[15] WANG H R, LI X H, REN Z Y, et al. Multimodal sentiment analysis representations learning via contrastive learning with condense attention fusion[J]. Sensors, 2023, 23(5): 2679.
[16] WANG D, TIAN C N, LIANG X, et al. Dual-perspective fusion network for aspect-based multimodal sentiment analysis[J]. IEEE Transactions on Multimedia, 2023, 26: 4028-4038.
[17] YU J F, CHEN K, XIA R. Hierarchical interactive multimodal transformer for aspect-based multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 1966-1978.
[18] YANG J, XU M Y, XIAO Y L, et al. AMIFN: aspect-guided multi-view interactions and fusion network for multimodal aspect-based sentiment analysis[J]. Neurocomputing, 2024, 573: 127222.
[19] WANG L, PENG J J, ZHENG C Z, et al. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning[J]. Information Processing & Management, 2024, 61(3): 103675.
[20] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543.
[21] ZHOU P, SHI W, TIAN J, et al. Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2016: 207-212.
[22] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.1556, 2014.
[23] ZHOU B L, LAPEDRIZA A, KHOSLA A, et al. Places: a 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1452-1464.
[24] WU Q, SHEN C H, LIU L Q, et al. What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 203-212.
[25] CAI Y T, CAI H Y, WAN X J. Multi-modal sarcasm detection in twitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 2506-2515.
[26] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[27] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[28] ZHAN F N, YU Y C, WU R L, et al. Multimodal image synthesis and editing: a survey and taxonomy[J]. arXiv:2112.
13592, 2021.
[29] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[30] HUANG L, MA D, LI S, et al. Text level graph neural network for text classification[J]. arXiv:1910.02356, 2019.
[31] WANG Y Z, QIAN S S, HU J, et al. Fake news detection via knowledge-driven multimodal graph convolutional networks[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval. New York: ACM, 2020: 540-547.
[32] VELICKOVIC P, CUCURULL G, CASANOVA A, et al. Graph attention networks[J]. arXiv:1710.10903, 2017.
[33] ZHU T, LI L D, YANG J F, et al. Multimodal emotion classification with multi-level semantic reasoning network[J]. IEEE Transactions on Multimedia, 2022, 25: 6868-6880.
[34] NIU T, ZHU S A, PANG L, et al. Sentiment analysis on multi-view social data[C]//Proceedings of the 22nd International Conference on MultiMedia Modeling. Cham: Springer, 2016: 15-27.
[35] YANG X C, FENG S, WANG D L, et al. Image-text multimodal emotion classification via multi-view attentional network[J]. IEEE Transactions on Multimedia, 2020, 23: 4014-4026.
[36] XU N, MAO W. MultiSentiNet: a deep semantic network for multimodal sentiment analysis[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017.
[37] KIM Y. Convolutional neural networks for sentence classification[J]. arXiv:1408.5882, 2014.
[38] XU N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network[C]//Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics. Piscataway: IEEE, 2017: 152-154.
[39] XU N, MAO W J, CHEN G D. A co-memory network for multimodal sentiment analysis[C]//Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. New York: ACM, 2018: 929-932.
[40] LI Z, XU B, ZHU C H, et al. CLMLF: a contrastive learning and multi-layer fusion method for multimodal sentiment detection[J]. arXiv:2204.05515, 2022.
[41] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.