深层注意力和两阶段融合的图文情感对比学习方法

doi:10.3778/j.issn.1002-8331.2309-0470

摘要/Abstract

摘要： 图文数据逐渐成为网络舆情的主流载体，图文情感分析利用多模态的信息互补效应提高情感分析效果，在人机对话、舆情监控等领域具备极大的应用潜能。以往的研究大多将图像和文本的特征拼接后再使用注意力进行融合，模态信息交互不充分，融合特征存在大量的噪声。提出了一种深层注意力和两阶段融合的图文情感对比学习方法。使用深层跨模态注意力网络进行模态交互，有助于提取不同模态的隐藏信息。设计的跨模态门控融合模块利用门控机制和注意力实现特征的两阶段融合，动态调整特征权重，降低数据噪声。模型通过对比学习和情感分类任务的联合训练，充分捕获与情感相关的跨模态共同特征，有利于提升模型的鲁棒性。方法在MVSA-Single、MVSA-Multiple和HFM等数据集上进行实验得到的准确率和F1值相较于基线模型的最优者平均提升了1.04和0.96个百分点。

关键词: 图文情感分析, 深层注意力, 两阶段融合, 门控注意力机制, 对比学习

Abstract: Image and text data have gradually become the mainstream carrier of online public opinion. Image-text sentiment analysis utilizes multimodal information complementarity to improve the effectiveness of sentiment analysis, and has great application potential in fields such as human-machine dialogue and public opinion monitoring. Previous studies have mostly concatenated image and text features before using attention for fusion, resulting in insufficient interaction of modal information and a large amount of noise in the fusion vector. The deep attention and two-stage fusion of image-text sentiment contrastive learning method is proposed. Modal interaction is performed using a deep cross-modal attention network to fully capture the hidden information of different modalities. The designed cross-modal gating fusion module utilizes the gating mechanism and attention to achieve two-stage fusion of features, which dynamically adjusts the weights of the features to reduce data noise. The model is also jointly trained for contrast learning and emotion classification tasks, which helps to capture cross-modal consistent emotional features and improve the robustness of the model. The accuracy and F1 values obtained from experiments on datasets such as MVSA-Single, MVSA-Multiple, and HFM improve by an average of 1.04 and 0.96?percentage points compared to the optimal baseline model.

Key words: image-text sentiment analysis, deep attention, two-stage fusion, gated-attention, contrastive learning

余本功, 石中玉. 深层注意力和两阶段融合的图文情感对比学习方法[J]. 计算机工程与应用, 2025, 61(3): 223-233.

YU Bengong, SHI Zhongyu. Deep Attention and Two-Stage Fusion of Image-Text Sentiment Contrastive Learning Method[J]. Computer Engineering and Applications, 2025, 61(3): 223-233.

参考文献

[1] 赵杨, 张雪, 王玮航, 等. 基于多模态情感分析的图书馆智能服务用户情感体验度量[J]. 情报科学, 2023, 41(9): 155-163.
ZHAO Y, ZHANG X, WANG W H, et al. Emotional experience measurement of library intelligent service users based on multi-modal emotional analysis[J]. Information Science, 2023, 41(9): 155-163.
[2] ZHOU J, JIN P, ZHAO J. Sentiment analysis of online reviews with a hierarchical attention network[C]//Proceedings of the International Conference on Software Engineering and Knowledge Engineering, 2020: 100-110.
[3] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[4] NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016: 284-288.
[5] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707.07250, 2017.
[6] GKOUMAS D, LI Q, LIOMA C, et al. What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184-197.
[7] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 6558-6569.
[8] RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 2359.
[9] ZHANG Q, SHI L, LIU P, et al. ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis[J]. Applied Intelligence, 2022, 53(12): 16332-16345.
[10] LIANG P P, LIU Z, ZADEH A, et al. Multimodal language analysis with recurrent multistage fusion[J]. arXiv:1808. 03920, 2018.
[11] XIAO X, PU Y, ZHAO Z, et al. Image-text sentiment analysis via context guided adaptive fine-tuning transformer[J]. Neural Processing Letters, 2023, 55(3): 2103-2125.
[12] 陈杰, 马静, 李晓峰, 等. 基于DR-Transformer模型的多模态情感识别研究[J]. 情报科学, 2022, 40(3): 117-125.
CHEN J, MA J, LI X F, et al. Multi-modal emotion recognition based on DR-Transformer model[J]. Information Science, 2022, 40(3): 117-125.
[13] BASU P, TIWARI S, MOHANTY J, et al. Multimodal sentiment analysis of #MeToo tweets using focal loss (grand challenge) [C]//Proceedings of the 2020 IEEE Sixth International Conference on Multimedia Big Data (Big MM), 2020.
[14] HUANG F, WEI K, WENG J, et al. Attention-based modality-gated networks for image-text sentiment analysis[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2020, 16(3): 1-19.
[15] YANG X, FENG S, WANG D, et al. Image-text multimodal emotion classification via multi-view attentional network[J]. IEEE Transactions on Multimedia, 2020, 23: 4014-4026.
[16] ZHOU B, LAPEDRIZA A, KHOSLA A, et al. Places: A 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(6): 1452-1464.
[17] YANG X, FENG S, ZHANG Y, et al. Multimodal sentiment detection based on multi-channel graph neural networks[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021: 328-339.
[18] YANG K, XU H, GAO K. CM-BERT: cross-modal BERT for text-audio sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 521-528.
[19] 包广斌, 李港乐, 王国雄. 面向多模态情感分析的双模态交互注意力[J]. 计算机科学与探索, 2022, 16(4) : 909-916.
BAO G B, LI G L, WANG G X. Bimodal interactive attention for multimodal sentiment analysis[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 909-916.
[20] LIAO W, ZENG B, LIU J, et al. Image-text interaction graph neural network for image-text sentiment analysis[J]. Applied Intelligence, 2022, 52(10): 11184-11198.
[21] HUANG F, ZHANG X, ZHAO Z, et al. Image-text sentiment analysis via deep multimodal attentive fusion[J]. Knowledge-Based Systems, 2019, 167: 26-37.
[22] ZHU T, LI L, YANG J, et al. Multimodal sentiment analysis with image-text interaction network[J]. IEEE Transactions on Multimedia, 2022, 35: 3375-3385.
[23] SUN T, WANG S, ZHONG S. Multi-granularity feature attention fusion network for image-text sentiment analysis[C]//Proceedings of the 39th International Conference on Computer Graphics (CGI 2022). Cham: Springer Nature Switzerland, 2022: 3-14.
[24] ZHAO Z, ZHU H, XUE Z, et al. An image-text consistency driven multimodal sentiment analysis approach for social media[J]. Information Processing & Management, 2019, 56(6): 102097.
[25] 缪裕青, 杨爽, 刘同来, 等. 基于跨模态门控机制和改进融合方法的多模态情感分析[J]. 计算机应用研究, 2023, 40(7): 2025-2030.
MIAO Y Q, YANG S, LIU T L, et al. Multimodal sentiment analysis based on cross-modal gating mechanism and improved fusion method[J]. Application Research of Computers, 2023, 40(7): 2025-2030.
[26] 刘青文, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 双元双模态下二次门控融合的多模态情感分析[J]. 计算机工程与应用, 2024, 60(8): 165-172.
LIU Q W, MAIRIDAN·W, GULANBAIER·T. Bi-bi-modality with bi-gated fusion in multimodal sentiment analysis[J]. Computer Engineering and Applications, 2024, 60(8): 165-172.
[27] MAI S, ZENG Y, ZHENG S, et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 2276-2289.
[28] WANG H, LI X, REN Z, et al. Multimodal sentiment analysis representations learning via contrastive learning with condense attention fusion[J]. Sensors, 2023, 23(5): 2679.
[29] LI Z, XU B, ZHU C, et al. CLMLF: a contrastive learning and multi-layer fusion method for multimodal sentiment detection[J]. arXiv:2204.05515, 2022.
[30] DEVLIN J, CHANG M W, LEE K, et al. BERT: pretraining of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019: 4171-4186.
[31] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[32] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[33] GONG Y, BOWMAN S R. Ruminating reader: reasoning with gated multi-hop attention[J]. arXiv:1704.07415, 2017.
[34] NIU Z, ZHONG G, YU H. A review on the attention mechanism of deep learning[J]. Neurocomputing, 2021, 452: 48-62.
[35] NIU T, ZHU S, PANG L, et al. Sentiment analysis on multi-view social data[C]//Proceedings of the 22nd International Conference on MultiMedia Modeling (MMM 2016), Miami, FL, USA, January 4-6, 2016: 15-27.
[36] CAI Y, CAI H, WAN X. Multi-modal sarcasm detection in twitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2506-2515.
[37] LUO Y R, ZHI L I, DEPARTMENT E E, et al. Word sense disambiguation in biomedical text based on Bi-LSTM[J]. Software Guide, 2019.
[38] HUANG L, MA D, LI S, et al. Text level graph neural network for text classification[J]. arXiv:1910.02356, 2019.
[39] CAI G, XIA B. Convolutional neural networks for multimedia sentiment analysis[C]//Proceedings of the 4th CCF Conference on Natural Language Processing and Chinese Computing (NLPCC 2015), Nanchang, China, October 9-13, 2015: 159-167.
[40] YU Y, LIN H, MENG J, et al. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks[J]. Algorithms, 2016, 9(2): 41.
[41] XU N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network[C]//Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), 2017: 152-154.
[42] CHEEMA G S, HAKIMOV S, MüLLER-BUDACK E, et al. A fair and comprehensive comparison of multimodal tweet sentiment analysis methods[C]//Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding, 2021: 37-45.
[43] 周婷, 杨长春. 基于多层注意力机制的图文双模态情感分析[J]. 计算机工程与设计, 2023, 44(6): 1853-1859.
ZHOU T, YANG C C. Image-text sentiment analysis based on multilevel attention mechanism[J]. Computer Engineering and Design, 2023, 44(6): 1853-1859.
[44] SCHIFANELLA R, DE JUAN P, TETREAULT J, et al. Detecting sarcasm in multimodal social platforms[C]//Proceedings of the 24th ACM International Conference on Multimedia, 2016: 1136-1145.
[45] XU N, ZENG Z, MAO W. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3777-3786.