基于多模态情感数据的网络视频满意度分析方法

doi:10.3778/j.issn.1002-8331.2510-0100

摘要/Abstract

摘要： 随着互联网和视频平台的快速发展，网络视频内容日益多样，如何有效评估用户对不同类型网络视频的满意度成为视频内容推广和人机交互研究领域的关键问题。尽管融合文本、语音和视觉信息的多模态情感分析方法已被广泛应用于用户情绪识别，但情绪状态并不能完全反映用户对内容的综合体验。现有研究往往仅停留在情感极性的建模，缺乏对情绪与满意度之间关联机制的探讨，导致满意度这一高阶心理结构长期被忽视。为了更加准确地评估用户对于网络视频的综合情感，提出了基于多模态融合的视频满意度分析框架——MVSA（multimodal video satisfaction analysis），同时，构建了一个针对网络视频用户满意度研究的多模态数据集MVS-Eval（multimodal video satisfaction evaluation），涵盖了吸引力、专注度、参与度等多维度满意度标签，旨在全面建模用户对视频内容的主观反馈，进一步提出了基于模态一致性训练和满意度引导融合机制的多模态满意度估计算法MUSE（multimodal understanding for satisfaction estimation），有效建立情绪-满意度链路，并提升了模型的满意度指标预测性能与跨场景泛化能力。此外，MVSA框架集成了一个智能反馈处理平台，能够自动解析用户反馈视频并生成结构化的满意度评估结果。实验结果表明，MUSE在多个基准任务中显著优于现有主流模型，验证了其在多类型网络视频满意度建模中的有效性与可解释性。

关键词: 网络视频, 多模态数据, 满意度分析

Abstract: With the rapid development of the internet and video platforms, online video content has become increasingly diverse. Effectively evaluating user satisfaction with different types of online videos has emerged as a critical issue in video content promotion and human-computer interaction research. Although multimodal sentiment analysis methods integrating text, audio, and vision information have been widely applied to user emotion recognition, emotional states alone cannot fully reflect users’ comprehensive experience of content. Existing research often remains confined to modeling affect polarity, neglecting the underlying mechanisms linking emotion to satisfaction. This has led to the long-term oversight of satisfaction as a higher-order psychological construct. To more accurately assess users’ holistic emotional responses to online videos, the paper proposes MVSA (multimodal video satisfaction analysis), a video satisfaction analysis framework based on multimodal fusion. Concurrently, the paper establishes MVS-Eval (multimodal video satisfaction evaluation), the first multimodal dataset specifically designed for online video user satisfaction research. This dataset encompasses satisfaction tags across multiple dimensions, including attractiveness, concentration, and engagement. This aims to comprehensively model users’ subjective feedback on video content. Furthermore, the paper proposes the multimodal satisfaction estimation algorithm MUSE (multimodal understanding for satisfaction estimation), based on modality consistency training and satisfaction-guided fusion mechanisms. This effectively establishes the emotion-satisfaction link and enhances the model’s satisfaction metric prediction performance and cross-scenario generalization capability. Additionally, the MVSA framework integrates an intelligent feedback processing platform that automatically parses user feedback videos and generates structured satisfaction evaluation results. Experimental results demonstrate that MUSE significantly outperforms existing mainstream models across multiple benchmark tasks, validating its effectiveness and interpretability in modeling satisfaction for diverse online video types.

Key words: online video, multimodal data, satisfaction analysis

王安启, 李明轩, 程泊宣. 基于多模态情感数据的网络视频满意度分析方法[J]. 计算机工程与应用, 2025, 61(23): 110-125.

WANG Anqi, LI Mingxuan, CHENG Boxuan. Method for Analyzing Satisfaction with Online Videos Based on Multimodal Emotional Data[J]. Computer Engineering and Applications, 2025, 61(23): 110-125.

参考文献

[1] 20 Amazing video consumption trends to watch out for (2024)[EB/OL]. (2024-08-29)[2025-07-17]. https://vidico.com/news/video-consumptiontrends.com.
[2] SMITH A K, BOLTON R N. The effect of customers’ emotional responses to service failures on their recovery effort evaluations and satisfaction judgments[J]. Journal of the Academy of Marketing Science, 2002, 30(1): 5-23.
[3] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 1103-1114.
[4] TSAI Y H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569.
[5] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[6] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797.
[7] RAHMAN W, HASAN M K, LEE S W, et al. Integrating multimodal information in large pretrained Transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2359-2369.
[8] ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
[9] ZADEH A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[10] YU W M, XU H, MENG F Y, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3718-3727.
[11] GUPTA A, JAISWAL R, ADHIKARI S, et al. DAISEE: dataset for affective states in E-learning environments[J]. arXiv:1609.01885, 2016.
[12] YANG F. SCB-Dataset: a dataset for detecting student and teacher classroom behavior[J]. arXiv:2304.02488, 2023.
[13] ZHANG H Y, WANG Y, YIN G H, et al. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 756-767.
[14] GUO Z R, JIN T, ZHAO Z. Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 1726-1736.
[15] PANG B, LEE L, VAITHYANATHAN S. Sentiment classification using machine learning techniques[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Philadelphia, USA: Association for Computational Linguistics, 2002: 79-86.
[16] MAAS A L, DALY R E, PHAM P T, et al. Learning word vectors for sentiment analysis[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011: 142-150.
[17] LEE S. Sentiment analysis system using stanford sentiment treebank[J]. Journal of the Korean Society of Marine Engineering, 2015, 39(3): 274-279.
[18] LIVINGSTONE S?R, RUSSO F?A. RAVDESS: The Ryerson audio?visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north American English[J]. PLoS One, 2018, 13(5): e0196391.
[19] BURKHARDT F, PAESCHKE A, ROLFES M, et al. A database of German emotional speech[C]//Proceedings of the 9th European Conference on Speech Communication and Technology, 2005.
[20] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis[C]//Proceedings of the 13th International Conference on Multimodal Interfaces, 2011: 169-176.
[21] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2247-2256.
[22] PORIA S, CAMBRIA E, HAZARIKA D, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, California, USA: AAAI Press, 2017: 5634-5640.
[23] HU X J, WANG X L, ZHAO J B, et al. UniMSE: unified multimodal sentiment explainer for interpretable sentiment analysis[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisboa, Portugal: ACM, 2022: 3117-3126.
[24] JORDAN P J, THOMPSON B M. Measuring user satisfaction: a framework for understanding and improving user experience[J]. International Journal of Human-Computer Studies, 2013, 71(3): 230-245.
[25] FORNELL C. A national customer satisfaction barometer: the Swedish experience[J]. Journal of Marketing, 1992, 56(1): 6-21.
[26] HU M Q, LIU B. Mining and summarizing customer reviews[C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2004: 168-177.
[27] MAIRESSE F, WALKER M A, MEHL M R, et al. Using linguistic cues for the automatic recognition of personality in conversation and text[J]. Journal of Artificial Intelligence Research, 2007, 30: 457-500.
[28] SOCHER R, PERELYGIN A, WU J, et al. Recursive deep models for semantic compositionality over a sentiment treebank[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2013: 1631-1642.
[29] HAZARIKA D, PORIA S, ZIMMERMANN R, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 3020-3030.
[30] GONG Y, LIU J, ZHANG F, et al. Multimodal cyclic translation network for visual and speech emotion recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA: IEEE, 2019: 10286-10295.
[31] DU W C, CHEN S B, ZHANG X M, et al. Hierarchical feature fusion network for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. Seattle, WA, USA: ACM, 2020: 2232-2240.
[32] ZHANG H, WANG W, YU T. Towards robust multimodal sentiment analysis with incomplete data[C]//Advances in Neural Information Processing Systems, 2024: 55943-55974.
[33] SUN L C, LIAN Z, LIU B, et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2024, 15(1): 309-325.