基于多模态融合的情感分析算法研究综述

doi:10.3778/j.issn.1002-8331.2305-0439

摘要/Abstract

摘要： 情感分析是一项新兴技术，其旨在探索人们对实体的态度，可应用于各种领域和场景，例如产品评价分析、舆情分析、心理健康分析和风险评估。传统的情感分析模型主要关注文本内容，然而一些特殊的表达形式，如讽刺和夸张，则很难通过文本检测出来。随着技术的不断进步，人们现在可以通过音频、图像和视频等多种渠道来表达自己的观点和感受，因此情感分析正向多模态转变，这也为情感分析带来了新的机遇。多模态情感分析除了包含文本信息外，还包含丰富的视觉和听觉信息，利用融合分析可以更准确地推断隐含的情感极性（积极、中性、消极）。多模态情感分析面临的主要挑战是跨模态情感信息的整合，因此，重点介绍了不同融合方法的框架和特点，并对近几年流行的融合算法进行了阐述，同时对目前小样本场景下的多模态情感分析进行了讨论，此外，还介绍了多模态情感分析的发展现状、常用数据集、特征提取算法、应用领域和存在的挑战。期望此综述能够帮助研究人员了解多模态情感分析领域的研究现状，并从中得到启发，开发出更加有效的模型。

关键词: 多模态, 情感分析, 模态融合

Abstract: Sentiment analysis is an emerging technology that aims to explore people’s attitudes toward entities and can be applied to various domains and scenarios, such as product evaluation analysis, public opinion analysis, mental health analysis and risk assessment. Traditional sentiment analysis models focus on text content, yet some special forms of expression, such as sarcasm and hyperbole, are difficult to detect through text. As technology continues to advance, people can now express their opinions and feelings through multiple channels such as audio, images and videos, so sentiment analysis is shifting to multimodality, which brings new opportunities for sentiment analysis. Multimodal sentiment analysis contains rich visual and auditory information in addition to textual information, and the implied sentiment polarity (positive, neutral, negative) can be inferred more accurately using fusion analysis. The main challenge of multimodal sentiment analysis is the integration of cross-modal sentiment information; therefore, this paper focuses on the framework and characteristics of different fusion methods and describes the popular fusion algorithms in recent years, and discusses the current multimodal sentiment analysis in small sample scenarios, in addition to the current development status, common datasets, feature extraction algorithms, application areas and challenges. It is expected that this review will help researchers understand the current state of research in the field of multimodal sentiment analysis and be inspired to develop more effective models.

Key words: multimodal, emotional analysis, modal fusion

郭续, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 基于多模态融合的情感分析算法研究综述[J]. 计算机工程与应用, 2024, 60(2): 1-18.

GUO Xu, Mairidan Wushouer, Gulanbaier Tuerhong. Survey of Sentiment Analysis Algorithms Based on Multimodal Fusion[J]. Computer Engineering and Applications, 2024, 60(2): 1-18.

参考文献

[1] ASUR S, HUBERMAN B A. Predicting the future with social media[C]//Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2010: 492-499.
[2] BOLLEN J, MAO H, ZENG X. Twitter mood predicts the stock market[J]. Journal of Computational Science, 2011, 2(1): 1-8.
[3] TUMASJAN A, SPRENGER T, SANDNER P, et al. Predicting elections with twitter: What 140 characters reveal about political sentiment[C]//Proceedings of the International AAAI Conference on Web and Social Media, 2010: 178-185.
[4] D'MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems[J]. ACM Computing Surveys (CSUR), 2015, 47(3): 1-36.
[5] PORIA S, CAMBRIA E, BAJPAI R, et al. A review of affective computing: from unimodal analysis to multimodal fusion[J]. Information Fusion, 2017, 37: 98-125.
[6] SOLEYMANI M, GARCIA D, JOU B, et al. A survey of multimodal sentiment analysis[J]. Image and Vision Computing, 2017, 65: 3-14.
[7] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv:1606.06259, 2016.
[8] ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU—mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018: 2236-2246.
[9] HUDDAR M G, SANNAKKI S S, RAJPUROHIT V S. A survey of computational approaches and challenges in multimodal sentiment analysis[J]. Int J Comput Sci Eng, 2019, 7(1): 876-883.
[10] GKOUMAS D, LI Q, LIOMA C, et al. What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184-197.
[11] CHANDRASEKARAN G, NGUYEN T N, HEMANTH D J. Multimodal sentimental analysis for social media applications: a comprehensive review[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2021, 11(5): e1415.
[12] ZADEH A, CAO Y S, HESSNER S, et al. CMU-MOSEAS: a multimodal language dataset for Spanish, Portuguese, German and French[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 1801.
[13] W?LLMER M, WENINGER F, KNAUP T, et al. Youtube movie reviews: sentiment analysis in an audio-visual context[J]. IEEE Intelligent Systems, 2013, 28(3): 46-53.
[14] YU W, XU H, MENG F, et al. CH-SIMs: a chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3718-3727.
[15] PéREZ-ROSAS V, MIHALCEA R, MORENCY L P. Utterance-level multimodal sentiment analysis[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013: 973-982.
[16] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42: 335-359.
[17] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: Harvesting opinions from the web[C]//Proceedings of the 13th International Conference on Multimodal Interfaces, 2011: 169-176.
[18] PORIA S, CAMBRIA E, GELBUKH A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015: 2539-2544.
[19] WANG H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis[C]//Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 2017: 949-954.
[20] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707.07250, 2017.
[21] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv:1806.00064, 2018.
[22] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the Conference, Association for Computational Linguistics, 2019.
[23] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017: 873-883.
[24] GHOSAL D, AKHTAR M S, CHAUHAN D, et al. Contextual inter-modal attention for multi-modal sentiment analysis[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 3454-3466.
[25] MAJUMDER N, HAZARIKA D, GELBUKH A, et al. Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124-133.
[26] KUMAR A, VEPA J. Gated mechanism for attention based multi modal sentiment analysis[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 4477-4481.
[27] ZHANG Q, SHI L, LIU P, et al. ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis[J]. Applied Intelligence, 2022, 53(12): 16332-16345.
[28] SALTON G, BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523.
[29] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[30] PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1532-1543.
[31] PORIA S, CAMBRIA E, HOWARD N, et al. Fusing audio, visual and textual clues for sentiment analysis from multimodal content[J]. Neurocomputing, 2016, 174: 50-59.
[32] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[33] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[34] MUNIKAR M, SHAKYA S, SHRESTHA A. Fine-grained sentiment classification using BERT[C]//2019 Artificial Intelligence for Transforming Business and Society (AITB), 2019: 1-5.
[35] ARACI D. Finbert: financial sentiment analysis with pre-trained language models[J]. arXiv:1908.10063, 2019.
[36] GRAVES A, FERNáNDEZ S, SCHMIDHUBER J. Bidirectional LSTM networks for improved phoneme classification and recognition[C]//Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications, Warsaw, Poland, September 11-15, 2005: 799-804.
[37] EYBEN F, W?LLMER M, GRAVES A, et al. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues[J]. Journal on Multimodal User Interfaces, 2010, 3: 7-19.
[38] EYBEN F, W?LLMER M, SCHULLER B. OpenEAR-introducing the Munich open-source emotion and affect recognition toolkit[C]//Proceedings of the 3rd International Conference on Affective Computing And Intelligent Interaction and Workshops, 2009: 1-6.
[39] EYBEN F, W?LLMER M, SCHULLER B. Opensmile: the munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia, 2010: 1459-1462.
[40] MCFEE B, RAFFEL C, LIANG D, et al. librosa: audio and music signal analysis in python[C]//Proceedings of the 14th Python in Science Conference, 2015: 18-25.
[41] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP—a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, 2014: 960-964.
[42] TAJADURA-JIMéNEZ A, V?STFJ?LL D. Auditory-induced emotion: A neglected channel for communication in human-computer interaction[J]. Affect and Emotion in Human-Computer Interaction: from Theory to Applications, 2008, 4868: 63-74.
[43] VOGT T, ANDRé E, WAGNER J. Automatic recognition of emotions from speech: a review of the literature and recommendations for practical realisation[J]. Affect and Emotion in Human-Computer Interaction: from Theory to Applications, 2008, 4868: 75-91.
[44] EL AYADI M, KAMEL M S, KARRAY F. Survey on speech emotion recognition: features, classification schemes, and databases[J]. Pattern Recognition, 2011, 44(3): 572-587.
[45] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110.
[46] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 4489-4497.
[47] LITTLEWORT G, WHITEHILL J, WU T, et al. The computer expression recognition toolbox (CERT)[C]//Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2011: 298-305.
[48] BALTRUSAITIS T, ZADEH A, LIM Y C, et al. Openface 2.0: facial behavior analysis toolkit[C]//Proceedings of the13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018: 59-66.
[49] 孙影影, 贾振堂, 朱昊宇. 多模态深度学习综述[J]. 计算机工程与应用, 2020, 56(21): 1-10.
SUN Y Y , JIA Z T , ZHU H Y. Survey of multimodal deep learning[J]. Computer Engineering and Applications, 2020, 56(21): 1-10.
[50] PARK S, SHIM H S, CHATTERJEE M, et al. Multimodal analysis and prediction of persuasiveness in online social multimedia[J]. ACM Transactions on Interactive Intelligent Systems (TiiS), 2016, 6(3): 1-25.
[51] PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//Proceedings of IEEE 16th International Conference on Data Mining, 2016: 439-448.
[52] NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016: 284-288.
[53] YU Y, LIN H, MENG J, et al. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks[J]. Algorithms, 2016, 9(2): 41.
[54] HUSSAIN M S, CALVO R A, AGHAEI POUR P. Hybrid fusion approach for detecting affects from multichannel physiology[C]//Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction, Memphis, TN, USA, October 9-12, 2011: 568-577.
[55] WANG H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: Improving cross-individual generalization in multimodal sentiment analysis[J]. arXiv:1609.05244, 2016.
[56] KOSSAIFI J, LIPTON Z C, KOLBEINSSON A, et al. Tensor regression networks[J]. The Journal of Machine Learning Research, 2020, 21(1): 4862-4882.
[57] YANG X, YUMER E, ASENTE P, et al. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5315-5324.
[58] LIANG P P, LIU Z, TSAI Y H H, et al. Learning representations from imperfect time series data via tensor rank regularization[J]. arXiv:1907.01011, 2019.
[59] 胡新荣, 陈志恒, 刘军平, 等. 基于多模态表示学习的情感分析框架[J]. 计算机科学, 2022, 49(S2): 631-636.
HU X R, CHEN Z H, LIU J P, et al. Sentiment analysis framework based on multimodal representation learning[J]. Computer Science, 2022, 49(S2): 631-636.
[60] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797.
[61] HU G, LIN T E, ZHAO Y, et al. UniMSE: towards unified multimodal sentiment analysis and emotion recognition[J]. arXiv:2211.11256, 2022.
[62] YANG K, XU H, GAO K. CM-BERT: cross-modal bert for text-audio sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 521-528.
[63] YU T, GAO H, LIN T E, et al. Speech-text dialog pre-training for spoken dialog understanding with explicit cross-modal alignment[J]. arXiv:2305.11579, 2023.
[64] BAREZI E J, FUNG P. Modality-based factorization for multimodal fusion[J]. arXiv:1811.12624, 2018.
[65] TUCKER L R. Some mathematical notes on three-mode factor analysis[J]. Psychometrika, 1966, 31(3): 279-311.
[66] HITCHCOCK F L. The expression of a tensor or a polyadic as a sum of products[J]. Journal of Mathematics and Physics, 1927, 6(1/4): 164-189.
[67] JIANG D, ZOU D, DENG Z, et al. Contextual multimodal sentiment analysis with information enhancement[J]. Journal of Physics: Conference Series, 2020, 1453(1): 012159.
[68] FINN C, ABBEEL P, LEVINE S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the International Conference on Machine Learning, 2017: 1126-1135.
[69] NICHOL A, ACHIAM J, SCHULMAN J. On first-order meta-learning algorithms[J]. arXiv:1803.02999, 2018.
[70] SNELL J, SWERSKY K, ZEMEL R. Prototypical networks for few-shot learning[C]//Advances in Neural Information Processing Systems, 2017.
[71] SUNG F, YANG Y, ZHANG L, et al. Learning to compare: Relation network for few-shot learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 1199-1208.
[72] VINYALS O, BLUNDELL C, LILLICRAP T, et al. Matching networks for one shot learning[C]//Advances in Neural Information Processing Systems, 2016.
[73] ZHANG C, CAI Y, LIN G, et al. DeepEMD: differentiable earth mover’s distance for few-shot learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(5): 5632-5648.
[74] LIU Y, LEE J, PARK M, et al. Learning to propagate labels: Transductive propagation network for few-shot learning[J]. arXiv:1805.10002, 2018.
[75] YANG L, LI L, ZHANG Z, et al. DPGN: distribution propagation graph network for few-shot learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 13390-13399.
[76] LEE K, MAJI S, RAVICHANDRAN A, et al. Meta-learning with differentiable convex optimization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 10657-10665.
[77] RUSU A A, RAO D, SYGNOWSKI J, et al. Meta-learning with latent embedding optimization[J]. arXiv:1807.05960, 2018.
[78] DAI W, LIU Z, YU T, et al. Modality-transferable emotion embeddings for low-resource multimodal emotion recognition[J]. arXiv:2009.09629, 2020.
[79] YANG X, FENG S, WANG D, et al. Few-shot multimodal sentiment analysis based on multimodal probabilistic fusion prompts[J]. arXiv:2211.06607, 2022.
[80] GAO T, FISCH A, CHEN D. Making pre-trained language models better few-shot learners[J]. arXiv:2012.15723, 2020.
[81] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
[82] MOKADY R, HERTZ A, BERMANO A H. ClipCap: clip prefix for image captioning[J]. arXiv:2111.09734, 2021.
[83] BROCK A, DE S, SMITH S L. Characterizing signal propagation to close the performance gap in unnormalized resnets[J]. arXiv:2101.08692, 2021.
[84] WANKHADE M, RAO A C S, KULKARNI C. A survey on sentiment analysis methods, applications, and challenges[J]. Artificial Intelligence Review, 2022, 55(7): 5731-5780.
[85] MADHU S. An approach to analyze suicidal tendency in blogs and tweets using sentiment analysis[J]. Int J Sci Res Comput Sci Eng, 2018, 6(4): 34-36.
[86] APALA K R, JOSE M, MOTNAM S, et al. Prediction of movies box office performance using social media[C]//Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis & Mining, 2013.
[87] DANG C N, MORENO-GARCíA M N, PRIETA F D. An approach to integrating sentiment analysis into recommender systems[J]. Sensors, 2021, 21(16): 5666.
[88] ELLIS J G, JOU B, CHANG S F. Why we watch the news: a dataset for exploring sentiment in broadcast video news[C]//Proceedings of the 16th International Conference on Multimodal Interaction, 2014: 104-111.
[89] MAO R, LIU Q, HE K, et al. The biases of pre-trained language models: an empirical study on prompt-based sentiment analysis and emotion detection[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 1743-1753.
[90] CASTRO S, HAZARIKA D, V PéREZ-ROSAS, et al. Towards multimodal sarcasm detection (an obviously perfect paper)[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL 2019), Florence, Italy, July 28- August2, 2019: 4619-4629.
[91] LIU B, ZHANG L. A survey of opinion mining and sentiment analysis[M]//Mining text data. Boston, MA: Springer, 2012: 415-463.
[92] PORIA S, HUSSAIN A, CAMBRIA E, et al. Combining textual clues with audio-visual information for multimodal sentiment analysis[J]. Multimodal Sentiment Analysis, 2018(1): 153-178.
[93] GROSMAN J S, FURTADO P, RODRIGUES A, et al. ERAS: improving the quality control in the annotation process for natural language processing tasks[J]. Information Systems, 2020, 93: 101553.
[94] ZHANG D, LI S, ZHU Q, et al. Effective sentiment-relevant word selection for multi-modal sentiment analysis in spoken language[C]//Proceedings of the 27th ACM International Conference on Multimedia, 2019: 148-156.
[95] HAN S, MAO R, CAMBRIA E. Hierarchical attention network for explainable depression detection on twitter aided by metaphor concept mappings[J]. arXiv:2209.07494, 2022.
[96] BIRJALI M, KASRI M, BENI-HSSANE A. A comprehensive survey on sentiment analysis: approaches, challenges and trends[J]. Knowledge-Based Systems, 2021, 226: 107134.