面向多源信息聚类与私有特征学习的情感分析

doi:10.3778/j.issn.1002-8331.2409-0391

摘要/Abstract

摘要： 针对目前情感分析模型的研究大多侧重于文本模态处理，而音频和视频模态的处理相对简单，未能充分挖掘其在增强情感信息方面的潜力，并且存在跨模态特征融合中的信息冗余问题。为此，提出了一种名为面向多源信息聚类与私有特征学习的情感分析的模型。引入隐性聚类的思维，通过跨模态注意力机制优化音视频与文本特征的互补能力，将不同模态的特征划分为若干类簇，以减少无关信息对融合过程的干扰。进一步地，通过特征一致性增强机制使用马氏距离度量方法对音视频模态特征进行增强和过滤，从而提升情感信息密度。与此同时，采用自适应权重调控机制，根据类簇的语义一致性来调节音视频模态的融合权重比例，并结合文本模态来消除模态间的语义歧义。此外，模型还引入自监督学习策略，进一步增强单模态的情感预测能力，帮助模型学到各模态的独特特性。实验结果表明，在CMU-MOSEI和CMU-MOSI数据集上，该模型在情感分类任务中的表现显著提升，验证了其在多模态信息融合和冗余信息抑制方面的有效性。

关键词: 多模态情感分析, 注意力机制, 隐性聚类, 马氏距离, 自监督学习

Abstract: Current sentiment analysis models often focus on text modality processing, while the handling of audio and video modalities remains relatively simple, failing to fully exploit their potential in enhancing emotional information. Additionally, there is the issue of information redundancy in cross-modal feature fusion. To address these challenges, this paper proposes a sentiment analysis model based on multimodal information clustering and private feature learning. By introducing the concept of latent clustering thinking, the model optimizes the complementarity of audio, vision, and text features through a cross-modal attention mechanism, dividing the features of different modalities into several clusters to reduce the interference of irrelevant information during the fusion process. Furthermore, a feature consistency enhancement mechanism using Mahalanobis distance is employed to enhance and filter audio and video modality features, thereby increasing the density of emotional information. Simultaneously, an adaptive weight adjustment mechanism is applied, which adjusts the fusion weight ratio of the audio and video modalities based on the semantic consistency of the clusters and combines them with the text modality to eliminate semantic ambiguity between modalities. Additionally, the model incorporates a self-supervised learning strategy to further enhance the emotional prediction ability of each modality, helping the model learn the unique characteristics of each modality. Experimental results on the CMU-MOSEI and CMU-MOSI datasets show significant improvements in sentiment classification performance, validating the effectiveness of the model in multimodal information fusion and redundancy suppression.

Key words: multimodal sentiment analysis, attention mechanism, latent clustering, Mahalanobis distance, self-supervised learning

钟婷, 冯广, 林健忠, 杨燕茹, 周垣桦, 郑润庭, 刘天翔. 面向多源信息聚类与私有特征学习的情感分析[J]. 计算机工程与应用, 2025, 61(24): 176-186.

ZHONG Ting, FENG Guang, LIN Jianzhong, YANG Yanru, ZHOU Yuanhua, ZHENG Runting, LIU Tianxiang. Multimodal Sentiment Analysis Focusing on Multisource Information Clustering and Private Feature Learning[J]. Computer Engineering and Applications, 2025, 61(24): 176-186.

参考文献

[1] YUAN Z Q, LI W, XU H, et al. Transformer-based feature reconstruction network for robust multimodal sentiment analysis[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4400-4407.
[2] SUN H, CHEN Y W, LIN L F. TensorFormer: a tensor-based multimodal transformer for multimodal sentiment analysis and depression detection[J]. IEEE Transactions on Affective Computing, 2023, 14(4): 2776-2786.
[3] HOU M, TANG J J, ZHANG J H, et al. Deep multimodal multilinear fusion with high-order polynomial pooling[C]//Advances in Neural Information Processing Systems, 2019: 12156-12166.
[4] WU Y, LIN Z J, ZHAO Y Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis[C]//Findings of the Association for Computational Linguistics. Stroudsburg: ACL, 2021: 4730-4738.
[5] HU G M, LIN T E, ZHAO Y, et al. UniMSE: towards unified multimodal sentiment analysis and emotion recognition[J]. arXiv:2211.11256, 2022.
[6] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707.07250, 2017.
[7] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv:1806.00064, 2018.
[8] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[9] TSAI Y H, LIANG P P, ZADEH A, et al. Learning factorized multimodal representations[J]. arXiv:1806.06176, 2018.
[10] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[11] RAHMAN W, HASAN M K, LEE S W, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2359-2369.
[12] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797.
[13] TSAI Y H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569.
[14] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[15] DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP: a collaborative voice analysis repository for speech technologies[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2014: 960-964.
[16] CHEONG J H, JOLLY E, XIE T K, et al. Py-feat: Python facial expression analysis toolbox[J]. Affective Science, 2023, 4(4): 781-796.
[17] LIN H, ZHANG P L, LING J D, et al. PS-mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing & Management, 2023, 60(2): 103229.
[18] LEE J, KIM S, KIM S, et al. Context-aware emotion recognition networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2020: 10142-10151.
[19] GENG X Y, LIU H, LEE L S, et al. Multimodal masked autoencoders learn transferable representations[J]. arXiv:2205.14204, 2022.
[20] MCLACHLAN G J. Mahalanobis distance[J]. Resonance, 1999, 4(6): 20-26.
[21] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv:1606.06259, 2016.
[22] BAGHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[23] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[J]. arXiv:2109.00412, 2021.
[24] WANG D, LIU S, WANG Q, et al. Cross-modal enhancement network for multimodal sentiment analysis[J]. IEEE Transactions on Multimedia, 2023, 25: 4909-4921.
[25] WANG D, GUO X T, TIAN Y M, et al. TETFN: a text enh-anced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259.
[26] LIU W L, XU H, HUA Y, et al. AdaFN-AG: enhancing multimodal interaction with adaptive feature normalization for multimodal sentiment analysis[J]. Intelligent Systems with Applications, 2024, 23: 200410.