计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (18): 208-216.DOI: 10.3778/j.issn.1002-8331.2306-0364

• 模式识别与人工智能 • 上一篇    下一篇

基于跨模态联合编码的多模态情感分析

孙斌,江涛,贾莉,崔伊明   

  1. 1. 西北民族大学 语言与文化计算教育部重点实验室,兰州 730030
    2. 南京信息工程大学 计算机学院,南京 210044
  • 出版日期:2024-09-15 发布日期:2024-09-13

Multimodal Sentiment Analysis Based on Cross-Modal Joint-Encoding

SUN Bin, JIANG Tao, JIA Li, CUI Yiming   

  1. 1.Key Laboratory of Language and Cultural Computing of Ministry of Education, Northwest Minzu University, Lanzhou 730030, China
    2.School of Computer Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
  • Online:2024-09-15 Published:2024-09-13

摘要: 如何提高多模态融合特征的有效性是多模态情感分析领域的热点问题之一。以往的研究大多通过设计复杂的融合策略获取融合特征表示,这些方法往往忽略了模态间复杂的关联关系,同时存在着由于模态信息不一致所导致的融合特征有效性降低问题,进而影响模型的性能。针对上述问题,提出一种基于跨模态联合编码的多模态情感分析模型。在特征提取方面,利用预训练模型BERT和Facet模型分别提取文本和视觉特征,经过一维卷积操作获取相同维度的单模态特征表示。特征融合方面,利用跨模态注意力模块获得两模态的联合特征,使用联合特征分别调整单模态特征的权重,将两者拼接后获得多模态融合特征,最终输入到全连接层中进行情感识别。在公开数据集CMU-MOSI上的广泛实验表明,该模型的情感分析结果优于大多数现有先进的多模态情感分析方法,能够有效提升情感分析的性能。

关键词: 多模态情感分析, 联合编码, 跨模态注意力, 多模态融合

Abstract: How to improve the effectiveness of multimodal fusion features is one of the hot issues in the field of multimodal sentiment analysis. Most of the previous studies obtained the representation of fusion features by designing complex fusion strategies. These methods have ignored the complex correlation between modes, and at the same time, the effectiveness of fusion features is reduced due to inconsistent mode information, thus affecting the performance of the model. To solve these problems, this paper proposes a multimodal sentiment analysis model based on cross-modal joint-encoding. In terms of feature extraction, pre-trained BERT and Facet models are used to extract text and visual features respectively, and unimodal feature representations of the same dimension are obtained through one-dimensional convolution operation. In terms of feature fusion, the cross-modal attention module is used to obtain joint features of two modalities, and the weights of unimodal features are adjusted using joint features, and the multimodal fusion features are obtained after the two are splice, and finally input into the fully connected layer for sentiment recognition. Extensive experiments on the public dataset CMU-MOSI have shown that the sentiment analysis results of this model are superior to most existing advanced multimodal sentiment analysis methods, and can effectively improve the performance of sentiment analysis.

Key words: multimodal sentiment analysis, joint-encoding, cross-modal attention, multimodal fusion