计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (24): 176-186.DOI: 10.3778/j.issn.1002-8331.2409-0391

• 模式识别与人工智能 • 上一篇    下一篇

面向多源信息聚类与私有特征学习的情感分析

钟婷1,冯广1+,林健忠1,杨燕茹2,周垣桦1,郑润庭2,刘天翔2   

  1. 1.广东工业大学 自动化学院,广州 510006
    2.广东工业大学 计算机学院,广州 510006
  • 出版日期:2025-12-15 发布日期:2025-12-15

Multimodal Sentiment Analysis Focusing on Multisource Information Clustering and Private Feature Learning

ZHONG Ting1, FENG Guang1+, LIN Jianzhong1, YANG Yanru2, ZHOU Yuanhua1, ZHENG Runting2, LIU Tianxiang2   

  1. 1.School of Automation, Guangdong University of Technology, Guangzhou 510006, China
    2.School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China
  • Online:2025-12-15 Published:2025-12-15

摘要: 针对目前情感分析模型的研究大多侧重于文本模态处理,而音频和视频模态的处理相对简单,未能充分挖掘其在增强情感信息方面的潜力,并且存在跨模态特征融合中的信息冗余问题。为此,提出了一种名为面向多源信息聚类与私有特征学习的情感分析的模型。引入隐性聚类的思维,通过跨模态注意力机制优化音视频与文本特征的互补能力,将不同模态的特征划分为若干类簇,以减少无关信息对融合过程的干扰。进一步地,通过特征一致性增强机制使用马氏距离度量方法对音视频模态特征进行增强和过滤,从而提升情感信息密度。与此同时,采用自适应权重调控机制,根据类簇的语义一致性来调节音视频模态的融合权重比例,并结合文本模态来消除模态间的语义歧义。此外,模型还引入自监督学习策略,进一步增强单模态的情感预测能力,帮助模型学到各模态的独特特性。实验结果表明,在CMU-MOSEI和CMU-MOSI数据集上,该模型在情感分类任务中的表现显著提升,验证了其在多模态信息融合和冗余信息抑制方面的有效性。

关键词: 多模态情感分析, 注意力机制, 隐性聚类, 马氏距离, 自监督学习

Abstract: Current sentiment analysis models often focus on text modality processing, while the handling of audio and video modalities remains relatively simple, failing to fully exploit their potential in enhancing emotional information. Additionally, there is the issue of information redundancy in cross-modal feature fusion. To address these challenges, this paper proposes a sentiment analysis model based on multimodal information clustering and private feature learning. By introducing the concept of latent clustering thinking, the model optimizes the complementarity of audio, vision, and text features through a cross-modal attention mechanism, dividing the features of different modalities into several clusters to reduce the interference of irrelevant information during the fusion process. Furthermore, a feature consistency enhancement mechanism using Mahalanobis distance is employed to enhance and filter audio and video modality features, thereby increasing the density of emotional information. Simultaneously, an adaptive weight adjustment mechanism is applied, which adjusts the fusion weight ratio of the audio and video modalities based on the semantic consistency of the clusters and combines them with the text modality to eliminate semantic ambiguity between modalities. Additionally, the model incorporates a self-supervised learning strategy to further enhance the emotional prediction ability of each modality, helping the model learn the unique characteristics of each modality. Experimental results on the CMU-MOSEI and CMU-MOSI datasets show significant improvements in sentiment classification performance, validating the effectiveness of the model in multimodal information fusion and redundancy suppression.

Key words: multimodal sentiment analysis, attention mechanism, latent clustering, Mahalanobis distance, self-supervised learning