计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (2): 179-190.DOI: 10.3778/j.issn.1002-8331.2406-0108

• 模式识别与人工智能 • 上一篇    下一篇

基于多模态信息融合的中文隐式情感分析

张换香,李梦云,张景   

  1. 1.内蒙古科技大学 创新创业教育学院,内蒙古 包头 014010
    2.上海大学 计算机工程与科学学院,上海 200444
    3.内蒙古科技大学 数智产业学院,内蒙古 包头 014010
    4.内蒙古科技大学 理学院,内蒙古 包头 014010
  • 出版日期:2025-01-15 发布日期:2025-01-15

Implicit Sentiment Analysis for Chinese Texts Based on Multimodal Information Fusion

ZHANG Huanxiang, LI Mengyun, ZHANG Jing   

  1. 1.College of Innovation and Entrepreneurship Education, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
    2.College of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
    3.College of Digital and Intelligent Industry, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
    4.College of Science, Inner Mongolia University of Science and Technology, Baotou, Inner Mongolia 014010, China
  • Online:2025-01-15 Published:2025-01-15

摘要: 隐式情感表达中缺乏显式情感词,给隐式情感分析带来一定的挑战。为有效解决此问题,借助外部信息是有效解决隐式情感分析的方法之一。与现有的主要借助单一文本信息的研究不同,提出一种融合多模态信息(包括语音和视频)的隐式情感分析方法。通过从语音中提取音调、强度等声学特征,以及从视频中捕捉面部表情等视觉特征,辅助理解隐式情感。利用BiLSTM网络挖掘各单模态内部的上下文信息;结合多头互注意力机制分别捕捉与文本相关的语音和视觉特征,并通过迭代优化,减少非文本模态的低阶冗余信息。此外,通过设计以文本为中心的交叉注意融合模块,强化隐式文本特征表示,并处理模态间的异质性,增强隐式情感分析的综合性能。在CMU-MOSI、CMU-MOSEI、MUMETA数据集上的实验结果表明,所提出的模型优于其他基线模型。这种针对隐式情感分析的多模态处理策略,充分利用语音和视觉外部知识,更全面、准确地捕捉隐式情感表达,有效提升了隐式情感分析的准确率。

关键词: 隐式情感分析, 深度神经网络, 多模态, 注意力机制, 特征融合

Abstract: The lack of explicit sentiment words in implicit sentiment expressions poses certain challenges to implicit sentiment analysis. In order to solve this problem, the paper can resort to external information, which is one of the methods for addressing implicit sentiment analysis. Different from the existing research that draws on single textual information, an implicit sentiment analysis method that incorporates multimodal information (including speech and video) is proposed. The method aids in understanding implicit sentiment by extracting acoustic features such as tone and intensity from speech and capturing visual features such as facial expressions from video. BiLSTM network is utilized to mine the contextual information within each unimodal state. The text-related speech and visual features are captured separately by combining the multi-head mutual attention mechanism, and are iteratively optimized to reduce the low-order redundant information of non-textual modalities. In addition, the comprehensive performance of implicit sentiment analysis is enhanced by designing a text-centered cross-attention fusion module to strengthen the implicit text feature representation and deal with inter-modal heterogeneity. Experimental results on CMU-MOSI, CMU-MOSEI, and MUMETA datasets show that the proposed model outperforms other baseline models. This multimodal processing strategy for implicit sentiment analysis makes full use of speech and visual external knowledge to capture implicit sentiment expressions more comprehensively and accurately, effectively improving the accuracy of implicit sentiment analysis.

Key words: implicit sentiment analysis, deep neural networks, multimodal, attention mechanism, feature fusion