计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (2): 1-18.DOI: 10.3778/j.issn.1002-8331.2305-0439
郭续,买日旦·吾守尔,古兰拜尔·吐尔洪
出版日期:
2024-01-15
发布日期:
2024-01-15
GUO Xu, Mairidan Wushouer, Gulanbaier Tuerhong
Online:
2024-01-15
Published:
2024-01-15
摘要: 情感分析是一项新兴技术,其旨在探索人们对实体的态度,可应用于各种领域和场景,例如产品评价分析、舆情分析、心理健康分析和风险评估。传统的情感分析模型主要关注文本内容,然而一些特殊的表达形式,如讽刺和夸张,则很难通过文本检测出来。随着技术的不断进步,人们现在可以通过音频、图像和视频等多种渠道来表达自己的观点和感受,因此情感分析正向多模态转变,这也为情感分析带来了新的机遇。多模态情感分析除了包含文本信息外,还包含丰富的视觉和听觉信息,利用融合分析可以更准确地推断隐含的情感极性(积极、中性、消极)。多模态情感分析面临的主要挑战是跨模态情感信息的整合,因此,重点介绍了不同融合方法的框架和特点,并对近几年流行的融合算法进行了阐述,同时对目前小样本场景下的多模态情感分析进行了讨论,此外,还介绍了多模态情感分析的发展现状、常用数据集、特征提取算法、应用领域和存在的挑战。期望此综述能够帮助研究人员了解多模态情感分析领域的研究现状,并从中得到启发,开发出更加有效的模型。
郭续, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 基于多模态融合的情感分析算法研究综述[J]. 计算机工程与应用, 2024, 60(2): 1-18.
GUO Xu, Mairidan Wushouer, Gulanbaier Tuerhong. Survey of Sentiment Analysis Algorithms Based on Multimodal Fusion[J]. Computer Engineering and Applications, 2024, 60(2): 1-18.
[1] ASUR S, HUBERMAN B A. Predicting the future with social media[C]//Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2010: 492-499. [2] BOLLEN J, MAO H, ZENG X. Twitter mood predicts the stock market[J]. Journal of Computational Science, 2011, 2(1): 1-8. [3] TUMASJAN A, SPRENGER T, SANDNER P, et al. Predicting elections with twitter: What 140 characters reveal about political sentiment[C]//Proceedings of the International AAAI Conference on Web and Social Media, 2010: 178-185. [4] D'MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems[J]. ACM Computing Surveys (CSUR), 2015, 47(3): 1-36. [5] PORIA S, CAMBRIA E, BAJPAI R, et al. A review of affective computing: from unimodal analysis to multimodal fusion[J]. Information Fusion, 2017, 37: 98-125. [6] SOLEYMANI M, GARCIA D, JOU B, et al. A survey of multimodal sentiment analysis[J]. Image and Vision Computing, 2017, 65: 3-14. [7] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv:1606.06259, 2016. [8] ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU—mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018: 2236-2246. [9] HUDDAR M G, SANNAKKI S S, RAJPUROHIT V S. A survey of computational approaches and challenges in multimodal sentiment analysis[J]. Int J Comput Sci Eng, 2019, 7(1): 876-883. [10] GKOUMAS D, LI Q, LIOMA C, et al. What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis[J]. Information Fusion, 2021, 66: 184-197. [11] CHANDRASEKARAN G, NGUYEN T N, HEMANTH D J. Multimodal sentimental analysis for social media applications: a comprehensive review[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2021, 11(5): e1415. [12] ZADEH A, CAO Y S, HESSNER S, et al. CMU-MOSEAS: a multimodal language dataset for Spanish, Portuguese, German and French[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020: 1801. [13] W?LLMER M, WENINGER F, KNAUP T, et al. Youtube movie reviews: sentiment analysis in an audio-visual context[J]. IEEE Intelligent Systems, 2013, 28(3): 46-53. [14] YU W, XU H, MENG F, et al. CH-SIMs: a chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 3718-3727. [15] PéREZ-ROSAS V, MIHALCEA R, MORENCY L P. Utterance-level multimodal sentiment analysis[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013: 973-982. [16] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42: 335-359. [17] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: Harvesting opinions from the web[C]//Proceedings of the 13th International Conference on Multimodal Interfaces, 2011: 169-176. [18] PORIA S, CAMBRIA E, GELBUKH A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015: 2539-2544. [19] WANG H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis[C]//Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 2017: 949-954. [20] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707.07250, 2017. [21] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv:1806.00064, 2018. [22] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the Conference, Association for Computational Linguistics, 2019. [23] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017: 873-883. [24] GHOSAL D, AKHTAR M S, CHAUHAN D, et al. Contextual inter-modal attention for multi-modal sentiment analysis[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 3454-3466. [25] MAJUMDER N, HAZARIKA D, GELBUKH A, et al. Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems, 2018, 161: 124-133. [26] KUMAR A, VEPA J. Gated mechanism for attention based multi modal sentiment analysis[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 4477-4481. [27] ZHANG Q, SHI L, LIU P, et al. ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis[J]. Applied Intelligence, 2022, 53(12): 16332-16345. [28] SALTON G, BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523. [29] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013. [30] PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1532-1543. [31] PORIA S, CAMBRIA E, HOWARD N, et al. Fusing audio, visual and textual clues for sentiment analysis from multimodal content[J]. Neurocomputing, 2016, 174: 50-59. [32] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018. [33] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017. [34] MUNIKAR M, SHAKYA S, SHRESTHA A. Fine-grained sentiment classification using BERT[C]//2019 Artificial Intelligence for Transforming Business and Society (AITB), 2019: 1-5. [35] ARACI D. Finbert: financial sentiment analysis with pre-trained language models[J]. arXiv:1908.10063, 2019. [36] GRAVES A, FERNáNDEZ S, SCHMIDHUBER J. Bidirectional LSTM networks for improved phoneme classification and recognition[C]//Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications, Warsaw, Poland, September 11-15, 2005: 799-804. [37] EYBEN F, W?LLMER M, GRAVES A, et al. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues[J]. Journal on Multimodal User Interfaces, 2010, 3: 7-19. [38] EYBEN F, W?LLMER M, SCHULLER B. OpenEAR-introducing the Munich open-source emotion and affect recognition toolkit[C]//Proceedings of the 3rd International Conference on Affective Computing And Intelligent Interaction and Workshops, 2009: 1-6. [39] EYBEN F, W?LLMER M, SCHULLER B. Opensmile: the munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia, 2010: 1459-1462. [40] MCFEE B, RAFFEL C, LIANG D, et al. librosa: audio and music signal analysis in python[C]//Proceedings of the 14th Python in Science Conference, 2015: 18-25. [41] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP—a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, 2014: 960-964. [42] TAJADURA-JIMéNEZ A, V?STFJ?LL D. Auditory-induced emotion: A neglected channel for communication in human-computer interaction[J]. Affect and Emotion in Human-Computer Interaction: from Theory to Applications, 2008, 4868: 63-74. [43] VOGT T, ANDRé E, WAGNER J. Automatic recognition of emotions from speech: a review of the literature and recommendations for practical realisation[J]. Affect and Emotion in Human-Computer Interaction: from Theory to Applications, 2008, 4868: 75-91. [44] EL AYADI M, KAMEL M S, KARRAY F. Survey on speech emotion recognition: features, classification schemes, and databases[J]. Pattern Recognition, 2011, 44(3): 572-587. [45] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110. [46] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 4489-4497. [47] LITTLEWORT G, WHITEHILL J, WU T, et al. The computer expression recognition toolbox (CERT)[C]//Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2011: 298-305. [48] BALTRUSAITIS T, ZADEH A, LIM Y C, et al. Openface 2.0: facial behavior analysis toolkit[C]//Proceedings of the13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018: 59-66. [49] 孙影影, 贾振堂, 朱昊宇. 多模态深度学习综述[J]. 计算机工程与应用, 2020, 56(21): 1-10. SUN Y Y , JIA Z T , ZHU H Y. Survey of multimodal deep learning[J]. Computer Engineering and Applications, 2020, 56(21): 1-10. [50] PARK S, SHIM H S, CHATTERJEE M, et al. Multimodal analysis and prediction of persuasiveness in online social multimedia[J]. ACM Transactions on Interactive Intelligent Systems (TiiS), 2016, 6(3): 1-25. [51] PORIA S, CHATURVEDI I, CAMBRIA E, et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//Proceedings of IEEE 16th International Conference on Data Mining, 2016: 439-448. [52] NOJAVANASGHARI B, GOPINATH D, KOUSHIK J, et al. Deep multimodal fusion for persuasiveness prediction[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016: 284-288. [53] YU Y, LIN H, MENG J, et al. Visual and textual sentiment analysis of a microblog using deep convolutional neural networks[J]. Algorithms, 2016, 9(2): 41. [54] HUSSAIN M S, CALVO R A, AGHAEI POUR P. Hybrid fusion approach for detecting affects from multichannel physiology[C]//Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction, Memphis, TN, USA, October 9-12, 2011: 568-577. [55] WANG H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: Improving cross-individual generalization in multimodal sentiment analysis[J]. arXiv:1609.05244, 2016. [56] KOSSAIFI J, LIPTON Z C, KOLBEINSSON A, et al. Tensor regression networks[J]. The Journal of Machine Learning Research, 2020, 21(1): 4862-4882. [57] YANG X, YUMER E, ASENTE P, et al. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5315-5324. [58] LIANG P P, LIU Z, TSAI Y H H, et al. Learning representations from imperfect time series data via tensor rank regularization[J]. arXiv:1907.01011, 2019. [59] 胡新荣, 陈志恒, 刘军平, 等. 基于多模态表示学习的情感分析框架[J]. 计算机科学, 2022, 49(S2): 631-636. HU X R, CHEN Z H, LIU J P, et al. Sentiment analysis framework based on multimodal representation learning[J]. Computer Science, 2022, 49(S2): 631-636. [60] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797. [61] HU G, LIN T E, ZHAO Y, et al. UniMSE: towards unified multimodal sentiment analysis and emotion recognition[J]. arXiv:2211.11256, 2022. [62] YANG K, XU H, GAO K. CM-BERT: cross-modal bert for text-audio sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 521-528. [63] YU T, GAO H, LIN T E, et al. Speech-text dialog pre-training for spoken dialog understanding with explicit cross-modal alignment[J]. arXiv:2305.11579, 2023. [64] BAREZI E J, FUNG P. Modality-based factorization for multimodal fusion[J]. arXiv:1811.12624, 2018. [65] TUCKER L R. Some mathematical notes on three-mode factor analysis[J]. Psychometrika, 1966, 31(3): 279-311. [66] HITCHCOCK F L. The expression of a tensor or a polyadic as a sum of products[J]. Journal of Mathematics and Physics, 1927, 6(1/4): 164-189. [67] JIANG D, ZOU D, DENG Z, et al. Contextual multimodal sentiment analysis with information enhancement[J]. Journal of Physics: Conference Series, 2020, 1453(1): 012159. [68] FINN C, ABBEEL P, LEVINE S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the International Conference on Machine Learning, 2017: 1126-1135. [69] NICHOL A, ACHIAM J, SCHULMAN J. On first-order meta-learning algorithms[J]. arXiv:1803.02999, 2018. [70] SNELL J, SWERSKY K, ZEMEL R. Prototypical networks for few-shot learning[C]//Advances in Neural Information Processing Systems, 2017. [71] SUNG F, YANG Y, ZHANG L, et al. Learning to compare: Relation network for few-shot learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 1199-1208. [72] VINYALS O, BLUNDELL C, LILLICRAP T, et al. Matching networks for one shot learning[C]//Advances in Neural Information Processing Systems, 2016. [73] ZHANG C, CAI Y, LIN G, et al. DeepEMD: differentiable earth mover’s distance for few-shot learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(5): 5632-5648. [74] LIU Y, LEE J, PARK M, et al. Learning to propagate labels: Transductive propagation network for few-shot learning[J]. arXiv:1805.10002, 2018. [75] YANG L, LI L, ZHANG Z, et al. DPGN: distribution propagation graph network for few-shot learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 13390-13399. [76] LEE K, MAJI S, RAVICHANDRAN A, et al. Meta-learning with differentiable convex optimization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 10657-10665. [77] RUSU A A, RAO D, SYGNOWSKI J, et al. Meta-learning with latent embedding optimization[J]. arXiv:1807.05960, 2018. [78] DAI W, LIU Z, YU T, et al. Modality-transferable emotion embeddings for low-resource multimodal emotion recognition[J]. arXiv:2009.09629, 2020. [79] YANG X, FENG S, WANG D, et al. Few-shot multimodal sentiment analysis based on multimodal probabilistic fusion prompts[J]. arXiv:2211.06607, 2022. [80] GAO T, FISCH A, CHEN D. Making pre-trained language models better few-shot learners[J]. arXiv:2012.15723, 2020. [81] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551. [82] MOKADY R, HERTZ A, BERMANO A H. ClipCap: clip prefix for image captioning[J]. arXiv:2111.09734, 2021. [83] BROCK A, DE S, SMITH S L. Characterizing signal propagation to close the performance gap in unnormalized resnets[J]. arXiv:2101.08692, 2021. [84] WANKHADE M, RAO A C S, KULKARNI C. A survey on sentiment analysis methods, applications, and challenges[J]. Artificial Intelligence Review, 2022, 55(7): 5731-5780. [85] MADHU S. An approach to analyze suicidal tendency in blogs and tweets using sentiment analysis[J]. Int J Sci Res Comput Sci Eng, 2018, 6(4): 34-36. [86] APALA K R, JOSE M, MOTNAM S, et al. Prediction of movies box office performance using social media[C]//Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis & Mining, 2013. [87] DANG C N, MORENO-GARCíA M N, PRIETA F D. An approach to integrating sentiment analysis into recommender systems[J]. Sensors, 2021, 21(16): 5666. [88] ELLIS J G, JOU B, CHANG S F. Why we watch the news: a dataset for exploring sentiment in broadcast video news[C]//Proceedings of the 16th International Conference on Multimodal Interaction, 2014: 104-111. [89] MAO R, LIU Q, HE K, et al. The biases of pre-trained language models: an empirical study on prompt-based sentiment analysis and emotion detection[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 1743-1753. [90] CASTRO S, HAZARIKA D, V PéREZ-ROSAS, et al. Towards multimodal sarcasm detection (an obviously perfect paper)[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL 2019), Florence, Italy, July 28- August2, 2019: 4619-4629. [91] LIU B, ZHANG L. A survey of opinion mining and sentiment analysis[M]//Mining text data. Boston, MA: Springer, 2012: 415-463. [92] PORIA S, HUSSAIN A, CAMBRIA E, et al. Combining textual clues with audio-visual information for multimodal sentiment analysis[J]. Multimodal Sentiment Analysis, 2018(1): 153-178. [93] GROSMAN J S, FURTADO P, RODRIGUES A, et al. ERAS: improving the quality control in the annotation process for natural language processing tasks[J]. Information Systems, 2020, 93: 101553. [94] ZHANG D, LI S, ZHU Q, et al. Effective sentiment-relevant word selection for multi-modal sentiment analysis in spoken language[C]//Proceedings of the 27th ACM International Conference on Multimedia, 2019: 148-156. [95] HAN S, MAO R, CAMBRIA E. Hierarchical attention network for explainable depression detection on twitter aided by metaphor concept mappings[J]. arXiv:2209.07494, 2022. [96] BIRJALI M, KASRI M, BENI-HSSANE A. A comprehensive survey on sentiment analysis: approaches, challenges and trends[J]. Knowledge-Based Systems, 2021, 226: 107134. |
[1] | 程子晨, 李彦, 葛江炜, 纠梦菲, 张敬伟. 利用信息瓶颈的多模态情感分析[J]. 计算机工程与应用, 2024, 60(2): 137-146. |
[2] | 高悦, 戴蒙, 张晴. 基于多模态特征交互的RGB-D显著性目标检测[J]. 计算机工程与应用, 2024, 60(2): 211-220. |
[3] | 屈立成, 郤丽媛, 刘紫君, 魏思, 董哲为. 跨模态语义时空动态交互情感分析研究[J]. 计算机工程与应用, 2024, 60(1): 165-173. |
[4] | 刘华玲, 陈尚辉, 乔梁, 刘雅欣. 多模态混合注意力机制的虚假新闻检测研究[J]. 计算机工程与应用, 2023, 59(9): 95-103. |
[5] | 蔡正奕, 赵杰煜, 朱峰. 融合图像特征的单阶段点云目标检测[J]. 计算机工程与应用, 2023, 59(9): 140-149. |
[6] | 周润民, 胡旭耀, 吴克伟, 于磊, 谢昭, 江龙. 基于交叉注意力的方面级情感分析[J]. 计算机工程与应用, 2023, 59(9): 190-197. |
[7] | 李卓容, 唐云祁. 基于深度学习的多模态生物特征融合模型[J]. 计算机工程与应用, 2023, 59(7): 180-189. |
[8] | 赵宏伟, 郑嘉俊, 赵鑫欣, 王胜春, 李浥东. 基于双模态深度学习的钢轨表面缺陷检测方法[J]. 计算机工程与应用, 2023, 59(7): 285-293. |
[9] | 陈晓婷, 李实. 对话情绪识别综述[J]. 计算机工程与应用, 2023, 59(3): 33-48. |
[10] | 林原, 王凯巧, 杨亮, 林鸿飞, 任璐, 丁堃. 基于pu-learning的同行评议文本情感分析[J]. 计算机工程与应用, 2023, 59(3): 143-149. |
[11] | 金叶磊, 古兰拜尔·吐尔洪, 买日旦·吾守尔. 情感分析中的多传感器数据融合研究综述[J]. 计算机工程与应用, 2023, 59(23): 1-14. |
[12] | 李建辛, 司冠南, 田鹏新, 安兆亮, 周风余. 多模态知识图谱的3D场景识别与表达方法综述[J]. 计算机工程与应用, 2023, 59(20): 35-50. |
[13] | 孙天伟, 杨长春, 顾晓清, 谈国胜. 结合共现网络的方面级情感分析研究[J]. 计算机工程与应用, 2023, 59(20): 111-118. |
[14] | 张济群, 张名芳, 郭军军, 相艳. 融合简化句法信息的端到端方面级情感分析[J]. 计算机工程与应用, 2023, 59(20): 129-137. |
[15] | 潘梦竹, 李千目, 邱天. 深度多模态表示学习的研究综述[J]. 计算机工程与应用, 2023, 59(2): 48-64. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||