计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (13): 124-135.DOI: 10.3778/j.issn.1002-8331.2302-0238

• 模式识别与人工智能 • 上一篇    下一篇

情感分析的跨模态Transformer组合模型

王亮,王屹,王军   

  1. 1.沈阳化工大学 计算机科学与技术学院,沈阳 110142
    2.辽宁省化工过程工业智能化技术重点实验室,沈阳 110142
  • 出版日期:2024-07-01 发布日期:2024-07-01

Cross-Modal Transformer Combination Model for Sentiment Analysis

WANG Liang, WANG Yi, WANG Jun   

  1. 1.School of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110142, China
    2.Liaoning Key Laboratory of Intelligent Technology for Chemical Process Industry, Shenyang 110142, China
  • Online:2024-07-01 Published:2024-07-01

摘要: 基于Transformer的端到端组合深度学习模型是多模态情感分析的主流模型。针对相关工作中此类模型存在的低资源(low-resource)模态数据的情感特征提取能力不足、不同模态非对齐数据的特征尺度差异导致对齐融合过程中易丢失关键特征信息、基础注意力模型并行处理多模态数据导致多模态长期依赖机制不可靠的问题,提出了一种基于轻量级注意力聚合模块与跨模态Transformer的能使用多模态非对齐数据执行二分类和多分类任务的多模态情感分析模型LAACMT。LAACMT模型提出采用门控循环单元与改进的特征提取算法提取低资源模态信息,提出位置编码配合卷积放缩方法用于对齐多模态语境,提出跨模态多头注意力机制融合已对齐的多模态数据并建立可靠的跨模态长期依赖机制。LAACMT模型在包含文本、语音和视频的三种模态非对齐数据集CMU-MOSI上的实验结果表明该模型的性能评价指标较SOTA有稳定提升。其中Acc7提升了3.96%、Acc2提升了4.08%、F1分数提升了3.35%。消融实验结果数据证明所提模型解决了多模态情感分析相关工作中存在的问题,降低了基于Transformer的多模态情感分析模型的复杂度,提升了模型性能的同时避免了过拟合问题。

关键词: 多模态情感分析, 轻量级注意力聚合模块, 跨模态Transformer, 门控循环单元, 跨模态多头注意力机制

Abstract: The Transformer-based end-to-end combination deep learning model is the mainstream model of multimodal sentiment analysis. In view of the lack of sentiment feature extraction ability of low-resource modal data, the difference of feature scales of non-aligned data in different modals, which lead to the loss of key feature information in the alignment and fusion process, and the unreliable multimodal long-term dependency mechanism caused by the parallel processing of multimodal data by the traditional attention model, this paper proposes an sentiment analysis model LAACMT based on lightweight attention aggregation module and cross-modal Transformer, which can use multimodal non-aligned data to perform binary classification and multiclass classification tasks. The model proposes to extract low-resource modal information using gated recurrent unit (GRU) and improved feature extraction algorithm, proposes positional encoding and convolution scaling methods for aligning multimodal contexts, proposes a multimodal multi-head attention mechanism to fuse aligned multimodal data and establishes a reliable cross-modal long-term dependency mechanism. The experimental results of the model on CMU-MOSI, which contains three modals of non-aligned dataset including text, voice and video, show that the performance evaluation index of the model has been steadily improved compared with SOTA, in which Acc7 has been improved by 3.96%, Acc2 has been improved by 4.08%, and F1 score has been improved by 3.35%. The results of ablation study show that the model proposed in this paper solves the problems in multimodal sentiment analysis, reduces the complexity of the multimodal sentiment analysis model based on Transformer, improves the performance of the model, and avoids over-fitting problems.

Key words: multimodal sentiment analysis, lightweight attentive aggregation module, cross-modal Transformer, gated recurrent unit, cross-modal multi-head attention mechanism