Image-Text Fusion Sentiment Analysis Method Based on Image Semantic Translation

doi:10.3778/j.issn.1002-8331.2203-0036

Abstract

Abstract: In multimodal sentiment analysis, images will generate different emotions under different circumstances or at different attention points. In order to solve problems related to image semantic understanding, it proposes a method for image-text fusion of sentiment analysis based on image semantic translation（ImaText-IST）. For a start, images are transmitted to image translation module to translate them into image captions. The module is integrated with different emotional expressions to capture image captions and generate image captions based on such three emotional polarities as positive, neutral and negative. Then, emotional correlation analysis is conducted based on the texts in the image captions at the aforesaid three emotional polarities as well as datasets to improve the accuracy of image semantic understanding. At last, sentiment prediction is performed based on image semantic captions, targets and texts, and sentiment analysis is conducted with feature fusion and auxiliary sentences. The results show that auxiliary sentences（Axu-ImaText-IST） can better understand the emotions of images and texts. The accuracy and Macro-F1 of social media datasets Twitter-15 and Twitter-17 are both higher than that of the benchmark model.

Key words: image-text fusion, multimodal sentiment analysis, image caption, emotional correlation

摘要： 多模态情感分析问题中，图像在不同情况或者对其关注点不同会产生不同的情感，为了解决图像语义理解问题，提出了基于图像语义翻译的图文融合情感分析（ImaText-IST）方法。将图像送入图像翻译模块将其翻译为图像描述，该模块融入了不同的情感表达来进行图像描述捕捉，分别生成积极、中性和消极三个情感极性的图像描述。通过三个情感极性的图像描述和数据集中的文本进行情感相关性分析，从而使得对图像情感理解更加准确。将图像语义描述、目标以及文本进行情感预测，分别采用特征融合及辅助语句的方式进行情感分析。实验结果表明，辅助语句的方式（Axu-ImaText-IST）能更好地理解图文的情感，在社交情感媒体数据集Twitter-15和Twitter-17的Accuracy和Macro-F1均高于基准模型。

关键词: 图文融合, 多模态情感分析, 图像描述, 情感相关性

HUANG Jian, WANG Ying. Image-Text Fusion Sentiment Analysis Method Based on Image Semantic Translation[J]. Computer Engineering and Applications, 2023, 59(11): 180-187.

黄健, 王颖. 基于图像语义翻译的图文融合情感分析方法[J]. 计算机工程与应用, 2023, 59(11): 180-187.

References

[1] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[2] MIKOLOV T，SUTSKEVER I，CHEN K，et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems，2013：3111-3119.
[3] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understan- ding[J].arXiv：1810.04805，2018.
[4] RAO T，LI X，ZHANG H，et al.Multi-level region-based convolutional neural network for image emotion classification[J].Neurocomputing，2019，333：429-439.
[5] YANG J，SHE D，SUN M，et al.Visual sentiment prediction based on automatic discovery of affective regions[J].IEEE Transactions on Multimedia，2018，20（9）：2513-2525.
[6] SONG K，YAO T，LING Q，et al.Boosting image sentiment analysis with visual attention[J].Neurocomputing，2018，312：218-228.
[7] 刘星.融合局部语义信息的多模态舆情分析模型[J].信息安全研究，2019，5（4）：340-345.
LIU X.Multimodal public opinion analysis model integrating local semantic information[J].Information Security Research，2019，5（4）：340-345.
[8] 胡慧君，冯梦媛，曹梦丽，等.基于语义相关的多模态社交情感分析[J].北京航空航天大学学报，2021，47（3）：469-477.
HU Huijun，FENG Mengyuan，CAO Mengli，et al.Multimodal social sentiment analysis based on semantic correlation[J].Journal of Beijing University of Aeronautics and Astronautics，2021，47（3）：469-477.
[9] SHUSTER K，HUMEAU S，HU H，et al.Engaging image captioning via personality[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：12516-12526.
[10] VINYALS O，TOSHEV A，BENGIO S，et al.Show and tell：a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3156-3164.
[11] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[J].arXiv：1512.03385，2015.
[12] LIU Y，OTT M，GOYAL N，et al.RoBERTa：a robustly optimized BERT pretraining approach[J].arXiv：1907. 11692，2019.
[13] NGUYEN D Q，VU T，NGUYEN A T.BERTweet：a pre-trained language model for English Tweets[J].arXiv：2005.
10200，2020.
[14] KHAN Z，FU Y.Exploiting BERT for multimodal target sentiment classification through input space translation[C]//Proceedings of the 29th ACM International Conference on Multimedia，2021：3034-3042.
[15] LU D，NEVES L，CARVALHO V，et al.Visual attention model for name tagging in multimodal social media[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics（Volume 1：Long Papers），2018：1990-1999.
[16] ZHANG Q，FU J，LIU X，et al.Adaptive co-attention network for named entity recognition in tweets[C]//Thirty-Second AAAI Conference on Artificial Intelligence，2018.
[17] YU J，JIANG J.Adapting BERT for target-oriented multimodal sentiment classification[C]//IJCAI，2019.
[18] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：common objects in context[C]//European Conference on Computer Vision.Cham：Springer，2014：740-755.
[19] MATHEWS A，XIE L，HE X.SentiCap：generating image descriptions with sentiments[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2016.