Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (3): 223-233.DOI: 10.3778/j.issn.1002-8331.2309-0470

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Deep Attention and Two-Stage Fusion of Image-Text Sentiment Contrastive Learning Method

YU Bengong, SHI Zhongyu   

  1. 1.School of Management, Hefei University of Technology, Hefei 230009, China
    2.Key Laboratory of Process Optimization & Intelligent Decision-Making, Ministry of Education, Hefei University of Technology, Hefei 230009, China
  • Online:2025-02-01 Published:2025-01-24

深层注意力和两阶段融合的图文情感对比学习方法

余本功,石中玉   

  1. 1.合肥工业大学 管理学院,合肥 230009
    2.合肥工业大学 过程优化与智能决策教育部重点实验室,合肥 230009

Abstract: Image and text data have gradually become the mainstream carrier of online public opinion. Image-text sentiment analysis utilizes multimodal information complementarity to improve the effectiveness of sentiment analysis, and has great application potential in fields such as human-machine dialogue and public opinion monitoring. Previous studies have mostly concatenated image and text features before using attention for fusion, resulting in insufficient interaction of modal information and a large amount of noise in the fusion vector. The deep attention and two-stage fusion of image-text sentiment contrastive learning method is proposed. Modal interaction is performed using a deep cross-modal attention network to fully capture the hidden information of different modalities. The designed cross-modal gating fusion module utilizes the gating mechanism and attention to achieve two-stage fusion of features, which dynamically adjusts the weights of the features to reduce data noise. The model is also jointly trained for contrast learning and emotion classification tasks, which helps to capture cross-modal consistent emotional features and improve the robustness of the model. The accuracy and F1 values obtained from experiments on datasets such as MVSA-Single, MVSA-Multiple, and HFM improve by an average of 1.04 and 0.96?percentage points compared to the optimal baseline model.

Key words: image-text sentiment analysis, deep attention, two-stage fusion, gated-attention, contrastive learning

摘要: 图文数据逐渐成为网络舆情的主流载体,图文情感分析利用多模态的信息互补效应提高情感分析效果,在人机对话、舆情监控等领域具备极大的应用潜能。以往的研究大多将图像和文本的特征拼接后再使用注意力进行融合,模态信息交互不充分,融合特征存在大量的噪声。提出了一种深层注意力和两阶段融合的图文情感对比学习方法。使用深层跨模态注意力网络进行模态交互,有助于提取不同模态的隐藏信息。设计的跨模态门控融合模块利用门控机制和注意力实现特征的两阶段融合,动态调整特征权重,降低数据噪声。模型通过对比学习和情感分类任务的联合训练,充分捕获与情感相关的跨模态共同特征,有利于提升模型的鲁棒性。方法在MVSA-Single、MVSA-Multiple和HFM等数据集上进行实验得到的准确率和F1值相较于基线模型的最优者平均提升了1.04和0.96个百分点。

关键词: 图文情感分析, 深层注意力, 两阶段融合, 门控注意力机制, 对比学习