基于自上而下掩码生成与层叠Transformer的多模态情感分析

doi:10.3778/j.issn.1002-8331.2404-0294

摘要/Abstract

摘要： 针对现有情感分析模型难以捕捉不同模态之间的信息相关性和跨模态特征融合中的信息冗余问题，提出了一种基于自上而下掩码生成与层叠Transformer的多模态情感分析模型。通过掩码生成模块，生成双模态特征的掩码并作用于另一模态，以挖掘不同模态间的相互关系和互补性，生成更丰富的模态特征表示。采用三层堆叠的Transformer结构，对多模态特征进行多层次融合，生成三个子模态融合向量，并有效合并以提升融合深度、避免冗余，最终得到用于情感分析的多模态特征融合向量。实验结果显示，在CMU-MOSI和CMU-MOSEI数据集上，模型表现优越，MAE值分别为0.675和0.508，二分类准确率分别达85.6%和85.1%。

关键词: 多模态情感分析, 模态融合, 掩码生成

Abstract: In response to the challenge of existing sentiment analysis models in capturing the information correlation between different modalities and addressing the redundancy issue in cross-modal feature fusion, this paper proposes a multi-modal sentiment analysis model based on top-down mask generation and stacked Transformers. Firstly, through the mask generation module, masks for bimodal features are generated and applied to the other modality to explore the interrelationships and complementarity between different modalities, thereby generating richer modality feature representations. Secondly, a three-layer stacked Transformer structure is employed to perform multi-level fusion of multimodal features, generating three sub-modality fusion vectors, which are effectively merged to enhance fusion depth and avoid redundancy, ultimately obtaining the multi-modal feature fusion vector for sentiment analysis. Experimental results demonstrate that on the CMU-MOSI and CMU-MOSEI datasets, compared to other state-of-the-art models, this model exhibits superior performance, with MAE values of 0.675 and 0.508 respectively, and binary classification accuracies reaching 85.6% and 85.1%.

Key words: multimodal sentiment analysis, modal fusion, mask generation

冯程, 杨海, 王淑娴, 李雪. 基于自上而下掩码生成与层叠Transformer的多模态情感分析[J]. 计算机工程与应用, 2025, 61(14): 214-222.

FENG Cheng, YANG Hai, WANG Shuxian, LI Xue. Multimodal Sentiment Analysis Based on Top-Down Mask Generation and Cascading Transformer[J]. Computer Engineering and Applications, 2025, 61(14): 214-222.

参考文献

[1] GEBRU I D, BA S, LI X, et al. Audio-visual speaker diarization based on spatiotemporal Bayesian fusion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(5): 1086-1099.
[2] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707.07250, 2017.
[3] MA D H, LI S J, ZHANG X D, et al. Interactive attention networks for aspect-level sentiment classification[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017: 4068-4074.
[4] LIN H, MA Z H, JI R R, et al. Boosting crowd counting via multifaceted attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 19596-19605.
[5] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018: 5642-5649.
[6] GU Y, YANG K, FU S, et al. Multimodal affective analysis using hierarchical attention strategy with word-level alignment[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 2225-2235.
[7]VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[8] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web[C]//Proceedings of the 13th International Conference on Multimodal Interfaces. New York: ACM, 2011: 169-176.
[9] WILLIAMS J, KLEINEGESSE S, COMANESCU R, et al. Recognizing emotions in video using multimodal DNN feature fusion[C]//Proceedings of the Grand Challenge and Workshop on Human Multimodal Language. Stroudsburg: ACL, 2018: 11-19.
[10] LAN Z Z, BAO L, YU S I, et al. Multimedia classification and event detection using double fusion[J]. Multimedia Tools and Applications, 2014, 71(1): 333-347.
[11] KHALIGH-RAZAVI S M, KRIEGESKORTE N. Deep supervised, but not unsupervised, models may explain IT cortical representation[J]. PLoS Computational Biology, 2014, 10(11): e1003915.
[12] Gü?Lü U, MARCEL A J, GERVEN V. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream[J]. Journal of Neuroscience, 2015, 35(27): 10005-10014.
[13] DWIVEDI K, BONNER M F, CICHY R M, et al. Unveiling functions of the visual cortex using task-specific deep neural networks[J]. PLoS Computational Biology, 2021, 17(8): e1009267.
[14] BERSCH D, DWIVEDI K, VILAS M, et al. Net2Brain: a toolbox to compare artificial vision models with human brain responses[J]. arXiv:2208.09677, 2022.
[15] SHAFER R L, SOLOMON E M, NEWELL K M, et al. Visual feedback during motor performance is associated with increased complexity and adaptability of motor and neural output[J]. Behavioural Brain Research, 2019, 376: 112214.
[16] DICARLO J J, ZOCCOLAN D, RUST N C. How does the brain solve visual object recognition?[J]. NEURON, 2012, 73(3): 415-434.
[17] TEUFEL C, NANAY B. How to (and how not to) think about top-down influences on visual perception[J]. Consciousness and Cognition, 2017, 47: 17-25.
[18] PARASKEVOPOULOS G, GEORGIOU E, POTAMIANOS A. MMLatch: bottom-up top-down fusion for multimodal sentiment analysis[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 4573-4577.
[19] OTMAKHOVA Y, SHIN H. Do we really need lexical information?towards a top-down approach to sentiment analysis of product reviews[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2015: 1559-1568.
[20] QIU Y, LIU Y, YANG H, et al. A simple saliency detection approach via automatic top-down feature fusion[J]. Neurocomputing, 2020, 388: 124-134.
[21] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[22] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[23] ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
[24] ZADEH B A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: cmu-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[25] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018: 5634-5641.
[26] SAHAY S, OKUR E, KUMAR S H, et al. Low rank fusion based transformers for multimodal sequences[J]. arXiv:2007.02038, 2020.
[27] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[28] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797.
[29] SUN H, WANG H Y, LIU J Q, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 3722-3729.