计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (20): 238-247.DOI: 10.3778/j.issn.1002-8331.2407-0236

• 模式识别与人工智能 • 上一篇    下一篇

结合图像字幕微调的多模态方面级情感分析

杨航,许云峰   

  1. 1.河北科技大学 信息科学与工程学院,石家庄 050000 
    2.陆军工程大学 石家庄校区,石家庄 050003
  • 出版日期:2025-10-15 发布日期:2025-10-15

Multimodal Aspect Level Sentiment Analysis with Fine-Tuning of Image Captions

YANG Hang, XU Yunfeng   

  1. 1.School of Information Science and Engineering,  Hebei University of Science and Technology,  Shijiazhuang 050000, China
    2.Shijiazhuang Campus of Army Engineering University,  Shijiazhuang 050003, China
  • Online:2025-10-15 Published:2025-10-15

摘要: 针对以往多模态方面级情感分析研究中数据噪声未被有效处理以及多模态数据特征融合不充分等问题,提出了一个基于图像字幕的多模态方面级情感识别网络MALERN(multimodal aspect level emotion recognition network)。MALERN在文本和图像做交互的同时引入了从图像数据集中提取的图像字幕作为文本数据的补充信息。为了能提取到更有效的图像字幕,MALERN采用基于无监督的BLIP2微调方法,相比于直接使用BLIP2来生成图像字幕,该方法能保证提取的图像字幕数据更加准确地描述图像信息。此外,在特征融合阶段,MALERN采用基于self-attention和LSTM的多模态特征融合网络(multimodal feature fusion network based on self-attention and LSTM,MFNSL)来实现多模态特征融合,相比于特征拼接方法,MFNSL能够有效处理图像和文本之间语义不相关的信息,从而在一定程度上缓解噪声的引入。实验结果表明,MALERN在公共数据集Twitter2015和Twitter2017上的准确率和F1值分别达到了79.36%、75.44%以及73.18%、71.30%,相较于最优基线模型分别提升了1.22、1.76个百分点以及2.04、2.14个百分点。实验说明MALERN能够充分利用多模态数据的语义信息来提高多模态方面级情感分析的预测结果。

关键词: 多模态特征融合, 图像字幕, BLIP2, RoBERTa, 注意力机制

Abstract: Aiming at the problems such as data noise not being effectively processed and insufficient feature fusion of multimodal data in previous multimodal aspect level sentiment analysis studies, this paper proposes a multimodal aspect level emotion recognition network (MALERN) based on image captions. The network introduces the image captions extracted from the image data set as the supplementary information of the text data while the text and image interact. In order to extract more effective image captions, MALERN adopts the unsupervised BLIP2 fine-tuning method. Compared with directly using BLIP2 to generate image captions, this method can ensure that the extracted image captions data can more accurately describe the image information. In addition, during the feature fusion phase, MALERN uses a multi-modal feature fusion network based on self-attention and LSTM to achieve multi-modal feature fusion. Compared with the feature stitching method, MFNSL can effectively process semantically unrelated information between image and text, thus alleviating the introduction of noise to a certain extent. The experimental results show that MALERN’s accuracy and F1 value on public datasets Twitter2015 and Twitter2017 reach 79.36% and 75.44%, 73.18% and 71.30%, respectively, with improvements of 1.22, 1.76 percentage points, and 2.04, 2.14 percentage points, respectively, compared with the optimal baseline model. It shows that MALERN can make full use of the semantic information of multimodal data to improve the prediction results of multimodal aspect level sentiment analysis.

Key words: multimodal feature fusion, image captioning, BLIP2, RoBERTa, attention mechanism