结合图像字幕微调的多模态方面级情感分析

doi:10.3778/j.issn.1002-8331.2407-0236

摘要/Abstract

摘要： 针对以往多模态方面级情感分析研究中数据噪声未被有效处理以及多模态数据特征融合不充分等问题，提出了一个基于图像字幕的多模态方面级情感识别网络MALERN（multimodal aspect level emotion recognition network）。MALERN在文本和图像做交互的同时引入了从图像数据集中提取的图像字幕作为文本数据的补充信息。为了能提取到更有效的图像字幕，MALERN采用基于无监督的BLIP2微调方法，相比于直接使用BLIP2来生成图像字幕，该方法能保证提取的图像字幕数据更加准确地描述图像信息。此外，在特征融合阶段，MALERN采用基于self-attention和LSTM的多模态特征融合网络（multimodal feature fusion network based on self-attention and LSTM，MFNSL）来实现多模态特征融合，相比于特征拼接方法，MFNSL能够有效处理图像和文本之间语义不相关的信息，从而在一定程度上缓解噪声的引入。实验结果表明，MALERN在公共数据集Twitter2015和Twitter2017上的准确率和F1值分别达到了79.36%、75.44%以及73.18%、71.30%，相较于最优基线模型分别提升了1.22、1.76个百分点以及2.04、2.14个百分点。实验说明MALERN能够充分利用多模态数据的语义信息来提高多模态方面级情感分析的预测结果。

关键词: 多模态特征融合, 图像字幕, BLIP2, RoBERTa, 注意力机制

Abstract: Aiming at the problems such as data noise not being effectively processed and insufficient feature fusion of multimodal data in previous multimodal aspect level sentiment analysis studies, this paper proposes a multimodal aspect level emotion recognition network (MALERN) based on image captions. The network introduces the image captions extracted from the image data set as the supplementary information of the text data while the text and image interact. In order to extract more effective image captions, MALERN adopts the unsupervised BLIP2 fine-tuning method. Compared with directly using BLIP2 to generate image captions, this method can ensure that the extracted image captions data can more accurately describe the image information. In addition, during the feature fusion phase, MALERN uses a multi-modal feature fusion network based on self-attention and LSTM to achieve multi-modal feature fusion. Compared with the feature stitching method, MFNSL can effectively process semantically unrelated information between image and text, thus alleviating the introduction of noise to a certain extent. The experimental results show that MALERN’s accuracy and F1 value on public datasets Twitter2015 and Twitter2017 reach 79.36% and 75.44%, 73.18% and 71.30%, respectively, with improvements of 1.22, 1.76 percentage points, and 2.04, 2.14 percentage points, respectively, compared with the optimal baseline model. It shows that MALERN can make full use of the semantic information of multimodal data to improve the prediction results of multimodal aspect level sentiment analysis.

Key words: multimodal feature fusion, image captioning, BLIP2, RoBERTa, attention mechanism

杨航, 许云峰. 结合图像字幕微调的多模态方面级情感分析[J]. 计算机工程与应用, 2025, 61(20): 238-247.

YANG Hang, XU Yunfeng. Multimodal Aspect Level Sentiment Analysis with Fine-Tuning of Image Captions[J]. Computer Engineering and Applications, 2025, 61(20): 238-247.

参考文献

[1] CHEEMA G S, HAKIMOV S, MüLLER-BUDACK E, et al. A fair and comprehensive comparison of multimodal tweet sentiment analysis methods[C]//Proceedings of the Workshop on Multi-Modal Pre-Training for Multimedia Understanding.New York: ACM, 2021: 37-45.
[2] JIANG L, YU M, ZHOU M, et al. Target-dependent twitter sentiment classification[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011: 151-160.
[3] XU N, MAO W J, CHEN G D. Multi-interactive memory network for aspect based multimodal sentiment analysis[C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 371-378.
[4] YU J F, JIANG J, XIA R. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification[J]. ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 429-439.
[5] YU J F, JIANG J. Adapting BERT for target-oriented multimodal sentiment classification[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019: 5408-5414.
[6] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017: 6000-6010.
[7] KHAN Z, FU Y. Exploiting BERT for multimodal target sentiment classification through input space translation[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 3034-3042.
[8] LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 19730-19742.
[9] RADFORD A, KIM JW, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Lea-rning, 2021: 8748-8763.
[10] JIA C, YANG Y, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 4904-4916.
[11] LI J, SELVARAJU R, GOTMARE A, et al. Align before fuse: vision and language representation learning with momentum distillation[C]//Advances in Neural Information Processing Systems, 2021: 9694-9705.
[12] WANG P, YANG A, MEN R, et al. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[J]. arXiv:2202.03052, 2022.
[13] ALAYRAC J B, DONAHUE J, LUC P, et al.Flamingo: a visual language model for few-shot learning[C]//Advances in Neural Information Processing Systems, 2022: 23716-23736.
[14] WANG W H, BAO H B, DONG L, et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks[J]. arXiv:2301.05781, 2023.
[15] LI J, LI D, XIONG C, et al. Blip: bootstrapping language-image pre-training for unified vision-language understan-ding and generation[C]//Proceedings of the International Conference on Machine Learning, 2022: 12888-12900.
[16] YANG R, WANG S, SUN Y Z, et al. Multimodal fusion remote sensing image audio retrieval[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 6220-6235.
[17] LIU X R, WANG Z J, WANG L. Multimodal fusion for image and text classification with feature selection and dimension reduction[J]. Journal of Physics: Conference Series, 2021, 1871(1): 012064.
[18] HUANG F R, ZHANG X M, ZHAO Z H, et al. Image text sentiment analysis via deep multimodal attentive fusion[J]. Knowledge-Based Systems, 2019, 167: 26-37.
[19] ZHU Q, YEH M C, CHENG K T. Multimodal fusion using learned text concepts for image categorization[C]//Proceedings of the 14th ACM International Conference on Multimedia. New York: ACM, 2006: 211-220.
[20] WILKINSON T, BRUN A. Semantic and verbatim word spotting using deep neural networks[C]//Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition. Piscataway: IEEE, 2016: 307-312.
[21] HE K M, ZHANG X Y, REN S Q, et al. Deep residual lear-ning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[22] PASZKE A, GROSS S, CHINTALA S, et al. Automatic differentiation in pytorch[C]//Proceedings of the 31st Conference on Neural Information Processing Systems, 2017.
[23] LOSHCHILOV I, HUTTER F. Fixing weight decay regularization in Adam[J]. arXiv:1711.05101, 2017.
[24] WANG Y Q, HUANG M L, ZHU X Y, et al. Attention-based LSTM for aspect-level sentiment classification[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2016: 606-615.
[25] CHEN P, SUN Z Q, BING L D, et al. Recurrent attention network on memory for aspect sentiment analysis[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 452-461.
[26] DEVLIN J, CHANG M-W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understa-nding[J]. arXiv:1810.04805, 2018.
[27] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[J]. arXiv:1907.11692, 2019.
[28] AN J Y, ZAINON N W, MOHDW, et al. Improving targeted multimodal sentiment classification with semantic description of images[J]. Computers, Materials & Continua, 2023, 75(3): 5801-5815.
[29] YU J F, CHEN K, XIA R. Hierarchical interactive multimodal Transformer for aspect-based multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 1966-1978.