Dynamic Dominant Fusion Multimodal Sentiment Analysis Method Based on Autoencoder

doi:10.3778/j.issn.1002-8331.2211-0010

Abstract

Abstract: In multimodal sentiment analysis, the modality that plays a dominant role in sentiment determination is dynamic. Usually, traditional multimodal sentiment analysis methods regard text modal as a dominant modal, but ignore the change in dominant modal at different moments due to the differences between modalities. Aiming at selecting dominant modal dynamically in each moment, this paper proposes a dynamic dominant fusion multimodal sentiment analysis method based on autoencoder. The method firstly encodes single modalities and obtains multimodal fusion features. And an autoencoder is applied to map them into a shared space. In the space, the dominant modality is selected by correlation between unimodal and fusion modal. Finally, the dominant multimodal information is used to guide multimodal fusion to obtain the multimodal robustness representation. The extensive experiments on the multimodal sentiment analysis benchmark dataset CMU-MOSI demonstrate the effectiveness of the proposed method, which outperform most of the existing state-of-the-art multimodal sentiment analysis methods.

Key words: multimodal sentiment analysis, dynamic complementarity, dominant modality, autoencoder

摘要： 多模态情感分析过程中，对情感判定起主导作用的模态常常是动态变化的。传统多模态情感分析方法中通常仅以文本为主导模态，而忽略了由于模态之间的差异性造成不同时刻主导模态的变化。针对如何在各个时刻动态选取主导模态的问题，提出一种自编码器动态主导融合的多模态情感分析方法。该方法首先对单模态编码并获得多模态融合特征，再利用自编码器将其表征到共享空间内；在此空间内衡量单模态特征与融合模态特征的相关程度，在各个时刻动态地选取相关程度最大的模态作为该时刻的主导模态；最后，利用主导模态引导多模态信息融合，得到多模态鲁棒性表征。在多模态情感分析基准数据集CMU-MOSI上进行广泛实验，实验结果表明提出方法的有效性，并且优于大多数现有最先进的多模态情感分析方法。

关键词: 多模态情感分析, 动态互补, 主导模态, 自编码器

YANG Xi, GUO Junjun, YAN Haining, TAN Kaiwen, XIANG Yan, YU Zhengtao. Dynamic Dominant Fusion Multimodal Sentiment Analysis Method Based on Autoencoder[J]. Computer Engineering and Applications, 2024, 60(6): 180-187.

杨溪, 郭军军, 严海宁, 谭凯文, 相艳, 余正涛. 自编码器动态主导融合的多模态情感分析[J]. 计算机工程与应用, 2024, 60(6): 180-187.

References

[1] HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA, Oct 12-16, 2020. New York: ACM, 2020: 1122-1131.
[2] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2021: 10790-10797.
[3] ZHENG I, ZHANG S, WANG X, et al. Multimodal representations learning based on mutual information maximization and minimization and identity embedding for multimodal sentiment analysis[J]. arXiv:2201.03969, 2022.
[4] LI X, CHEN M. Multimodal sentiment analysis with multi-perspective fusion network focusing on sense attentive language[C]//Proceedings of the 19th Chinese National Conference on Computational Linguistics, Haikou China, Oct 30-Nov 1, 2020.[S.l.]: Chinese Information Processing Society of China, 2020: 359-373.
[5] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Sep 7-11, 2017. Stroudsburg USA, PA: Association for Computational Linguistics, 2017: 1103-1114.
[6] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne Australia, July 15-20, 2018. Cambridge, UK: Cambridge University Press, 2018: 2247-2256.
[7] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver Canada, Jul 30-Aug 4, 2017. Cambridge, UK: Cambridge University Press, 2017: 873-883.
[8] CHEN F, LUO Z, XU Y, et al. Complementary fusion of multi-features and multi-modalities in sentiment analysis[J]. arXiv:1904.08138, 2019.
[9] MAJUMDER N, PORIA S, HAZARIKA D, et al. DialogueRNN: an attentive RNN for emotion detection in conversations[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Hawaii, USA, Jan 27-Feb 1, 2019. Palo Alto, CA: AAAI Press, 2019: 6818-6825.
[10] CHOI W Y, SONG K Y, LEE C W. Convolutional attention networks for multimodal emotion recognition from speech and text data[C]//Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), Melbourne Australia, Jul 20, 2018. Stroudsburg USA, PA: Association for Computational Linguistics, 2018: 28-34.
[11] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence Italy, Jul 28-Aug 2, 2019. Cambridge, UK: Cambridge University Press, 2019: 6558-6569.
[12] SIRIWARDHANA S, REIS A, WEERASEKERA R, et al. Jointly fine-tuning “BERT-like” self-supervised models to improve multimodal speech emotion recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association, Oct 25-29, 2020.[S.l.]: International Speech Communication Association (ISCA), 2020: 3755-3759.
[13] SIRIWARDHANA S, KALUARACHCHI T, BILLINGHURST M, et al. Multimodal emotion recognition with transformer-based self supervised feature fusion[J]. IEEE Access, 2020, 8: 176274-176285.
[14] ZHANG D, JU X, LI J, et al. Multi-modal multi-label emotion detection with modality and label dependence[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Nov 16-20, 2020. Stroudsburg USA, PA: Association for Computational Linguistics, 2020: 3584-3593.
[15] HE J, ZHANG C Q, LI X Z, et al. Survey of research on multimodal fusion technology for deep learning[J]. Computer Engineering, 2020, 46(5): 1-11.
[16] CHEN M, LI X. SWAFN: sentimental words aware fusion network for multimodal sentiment analysis[C]//Proceedings of the 28th International Conference on Computational Linguistics, Barcelona Spain (online), Dec 8-13, 2020.[S.l.]: International Committee on Computational Linguistics, 2020: 1067-1077.
[17] SUN Z, SARMA P, SETHARES W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, Hilton New York, Feb 7-12, 2020. Palo Alto, CA: AAAI Press, 2020: 8992-8999.
[18] WU Y, LIN Z, ZHAO Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]//Findings of the Association for Computational Linguistics (ACL-IJCNLP 2021) , Aug 2-5, 2021. Stroudsburg USA, PA: Association for Computational Linguistics, 2021: 4730-4738.
[19] MAI S, HU H, XING S. Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence Italy, Jul 28-Aug 2, 2019. Cambridge, UK: Cambridge University Press, 2019: 481-492.
[20] DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Jun 2-7, 2019. Stroudsburg USA, PA: Association for Computational Linguistics, 2020: 4171-4186.
[21] NG A. Sparse autoencoder[J]. CS294A Lecture Notes, 2011, 72: 1-19.
[22] ZADEH A, ZELLERS R, PINCUS E, et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv:1606.06259, 2016.
[23] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, Feb 2-7, 2018. Palo Alto, CA: AAAI Press, 2018: 5634-5641.