计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (14): 214-222.DOI: 10.3778/j.issn.1002-8331.2404-0294

• 模式识别与人工智能 • 上一篇    下一篇

基于自上而下掩码生成与层叠Transformer的多模态情感分析

冯程,杨海,王淑娴,李雪   

  1. 山东交通学院 信息科学与电气工程学院,济南 250357
  • 出版日期:2025-07-15 发布日期:2025-07-15

Multimodal Sentiment Analysis Based on Top-Down Mask Generation and Cascading Transformer

FENG Cheng, YANG Hai, WANG Shuxian, LI Xue   

  1. College of Information Science and Electrical Engineering, Shandong Jiaotong University, Jinan 250357, China
  • Online:2025-07-15 Published:2025-07-15

摘要: 针对现有情感分析模型难以捕捉不同模态之间的信息相关性和跨模态特征融合中的信息冗余问题,提出了一种基于自上而下掩码生成与层叠Transformer的多模态情感分析模型。通过掩码生成模块,生成双模态特征的掩码并作用于另一模态,以挖掘不同模态间的相互关系和互补性,生成更丰富的模态特征表示。采用三层堆叠的Transformer结构,对多模态特征进行多层次融合,生成三个子模态融合向量,并有效合并以提升融合深度、避免冗余,最终得到用于情感分析的多模态特征融合向量。实验结果显示,在CMU-MOSI和CMU-MOSEI数据集上,模型表现优越,MAE值分别为0.675和0.508,二分类准确率分别达85.6%和85.1%。

关键词: 多模态情感分析, 模态融合, 掩码生成

Abstract: In response to the challenge of existing sentiment analysis models in capturing the information correlation between different modalities and addressing the redundancy issue in cross-modal feature fusion, this paper proposes a multi-modal sentiment analysis model based on top-down mask generation and stacked Transformers. Firstly, through the mask generation module, masks for bimodal features are generated and applied to the other modality to explore the interrelationships and complementarity between different modalities, thereby generating richer modality feature representations. Secondly, a three-layer stacked Transformer structure is employed to perform multi-level fusion of multimodal features, generating three sub-modality fusion vectors, which are effectively merged to enhance fusion depth and avoid redundancy, ultimately obtaining the multi-modal feature fusion vector for sentiment analysis. Experimental results demonstrate that on the CMU-MOSI and CMU-MOSEI datasets, compared to other state-of-the-art models, this model exhibits superior performance, with MAE values of 0.675 and 0.508 respectively, and binary classification accuracies reaching 85.6% and 85.1%.

Key words: multimodal sentiment analysis, modal fusion, mask generation