双元双模态下二次门控融合的多模态情感分析

doi:10.3778/j.issn.1002-8331.2302-0088

摘要/Abstract

摘要： 为了平衡情感信息在不同模态中分布的不均匀性，获得更深层次的多模态情感表征，提出了一种基于双元双模态二次门控融合的多模态情感分析方法。对文本、视觉模态，文本、语音模态分别融合，充分考虑文本模态在三个模态中的优势地位。同时为了获得更深层次的多模态交互信息，使用二次融合。在第一次融合中，使用融合门决定向主模态添加多少补充模态的知识，得到两个双模态混合知识矩阵。在第二次融合中，考虑到两个双模态混合知识矩阵中存在冗余、重复的信息，使用选择门从中选择有效、精简的情感信息作为双模态融合后的知识。在公开数据集CMU-MOSEI上，情感二分类的准确率和F1值分别达到了86.2%、86.1%，表现出良好的健壮性和先进性。

关键词: 多模态情感分析, 双元双模态, 二次融合, 门控注意力机制

Abstract: In order to balance the uneven distribution of emotional information in different modalities and obtain a deeper multimodal emotional representation, this paper proposes a method called that bi-bi-modality with bi-gated fusion in multimodal sentiment analysis (BBBGF). In the process of fusing text-vision modality, text-audio modalities, the dominant position of the text modality among the three modalities is fully considered. At the same time, the dual fusion is used to obtain the multimodal emotional interaction information at the deeper level. In the first fusion, a fusion gate is used to decide how much knowledge of the supplement modality is added to the main modality, and getting two bi-modality hybrid knowledge matrices. In the second fusion, considering the redundant and repeated information in the two bi-modality mixed knowledge matrices, a selection gate is used to select effective and non-repeating emotional information as the final knowledge. On the public dataset CMU-MOSEI, the accuracy and F1 value of the sentiment binary classification reaches 86.2% and 86.1%, respectively, showing good robustness and advancement.

Key words: multimodal emotional analysis, bi-bi-modality, bi-gated fusion, gated-attention

刘青文, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 双元双模态下二次门控融合的多模态情感分析[J]. 计算机工程与应用, 2024, 60(8): 165-172.

LIU Qingwen, Mairidan·Wushouer, Gulanbaier·Tuerhong. Bi-Bi-Modality with Bi-Gated Fusion in Multimodal Sentiment Analysis[J]. Computer Engineering and Applications, 2024, 60(8): 165-172.

参考文献

[1] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, 2021: 6-15.
[2] JIN Q, LI C, CHEN S, et al. Speech emotion recognition with acoustic and lexical features[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, 2015: 4749-4753.
[3] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web[C]//Proceedings of the 13th International Conference on Multimodal Interfaces, 2011: 169-176.
[4] SHUTOVA E, KIELA D, MAILLARD J. Black holes and white rabbits: metaphor identification with visual features[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 160-170.
[5] EVANGELOPOULOS G, ZLATINTSI A, POTAMIANOS A, et al. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention[J]. IEEE Transactions on Multimedia, 2013, 15(7): 1553-1568.
[6] MORVANT E, HABRARD A, AYACHE S. Majority vote of diverse classifiers for late fusion[C]//Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), 2014: 153-162.
[7] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J].arXiv:1707.07250, 2017.
[8] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv:1806.00064, 2018.
[9] MAI S, HU H, XING S. Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 481-492.
[10] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[11] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[12] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the Conference Association for Computational Linguistics, 2019.
[13] ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 2236-2246.
[14] MAI S, HU H, XING S. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 164-172.
[15] MAI S, XING S, HE J, et al. Analyzing unaligned multimodal sequence via graph convolution and graph pooling fusion[J]. arXiv:2011.13572, 2020.
[16] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[17] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP—a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, 2014: 960-964.
[18] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv:1412.3555, 2014.
[19] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[20] ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
[21] HASAN M K, RAHMAN W, ZADEH A, et al. UR-FUNNY: a multimodal language dataset for understanding humor[J].arXiv:1904.06618, 2019.
[22] LI Q, GKOUMAS D, LIOMA C, et al. Quantum-inspired multimodal fusion for video sentiment analysis[J]. Information Fusion, 2021, 65: 58-71.
[23] SUN Z, SARMA P, SETHARES W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 8992-8999.
[24] HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 1122-1131.
[25] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797.