专家路由的方面级多模态情感分析

doi:10.3778/j.issn.1002-8331.2401-0274

摘要/Abstract

摘要： 在方面级多模态情感分析领域，通过方面术语提取和方面级情感分类任务获取句子中的方面-情感对，前者提取句子中人物、商品等实体的方面词，后者根据给定的方面术语预测用户的情感极性。现有两种主流方法完成两个子任务，但存在各自的问题：（1）使用两个独立模型分别处理两个子任务，不同模型之间语义关联度较差，两个任务之间的底层特征无法得到延续；（2）使用一个模型同时处理两个子任务，两个任务共享一套模型参数，难以针对方面术语提取和方面级情感分类特点提升各任务性能，使提取方面-情感对的效率低。为解决上述问题，提出了专家路由的方面级多模态情感分析方法。在一个模型中针对性处理两个子任务，引入专家路由思想，采用稀疏-激活策略，即并非所有参数都会在处理每个输入时被激活，而是根据输入的特定任务需求，只有部分参数集合被调用处理各个任务。模型利用图像（文本）关键信息关注文本（图像）相关联的部分，形成视觉区域与包含情感信息方面词的初步的局部对应语义关联，通过门控单元获取模态间共享互补的深度混合语义矩阵。最后通过方面级情感分类模块进行情感预测。在两个公开数据集Twitter2015和Twitter2017上的实验结果表明该模型优于一系列基线模型。

关键词: 方面级多模态情感分析, 专家路由, 门控单元, 注意力机制

Abstract: In the field of aspect-level multimodal sentiment analysis, aspect-sentiment pairs in sentences are obtained through aspect-term extraction and aspect-level sentiment classification tasks. The former extracts aspects of entities such as characters and products in the sentence, while the latter predicts the user’s sentiment polarity based on given aspect-terms. There are currently two mainstream methods to complete two subtasks, but each has its own problems: (1) Using two independent models to handle two subtasks separately. The semantic correlation between different models is poor, and the underlying features between the two tasks cannot be continued. (2) Using one model to process two subtasks simultaneously. Two tasks share a set of model parameters, making it difficult to improve the performance of each task based on the characteristics of aspect-term extraction and aspect-level sentiment classification, resulting in low efficiency in extracting aspect-sentiment pairs. To address the aforementioned issues, aspect-level multimodal sentiment analysis based on experts routing is proposed. In one model, two subtasks are targeted. Expert routing idea is introduced and sparse activation strategy is adopted. Not all parameters will be activated when processing each input, but only a subset of parameter sets will be called to process each task based on the specific task requirements of the input. In addition, the model utilizes the key information of the image (text) to focus on the parts related to the text (image), forming a preliminary local corresponding semantic association between the visual region and the words containing sentimental information. The shared and complementary deep mixed semantic matrix between modalities is obtained through gating units. Finally, sentiment prediction is performed through the aspect-level sentiment classification module. The experimental results on two public datasets, Twitter2015 and Twitter2017, show that the model outperforms a series of baseline models.

Key words: aspect-level multimodal sentiment analysis, experts routing, gating units, attention mechanism

赵京胜, 王永政, 杨心怡, 曲维龙, 朱巧明. 专家路由的方面级多模态情感分析[J]. 计算机工程与应用, 2025, 61(10): 155-165.

ZHAO Jingsheng, WANG Yongzheng, YANG Xinyi, QU Weilong, ZHU Qiaoming. Aspect-Level Multimodal Sentiment Analysis Based on Experts Routing[J]. Computer Engineering and Applications, 2025, 61(10): 155-165.

参考文献

[1] BALTRUSAITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423-443.
[2] ZHANG Q A, SHI L, LIU P Y, et al. ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis[J]. Applied Intelligence, 2023, 53(12): 16332-16345.
[3] WANG D, GUO X T, TIAN Y M, et al. TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259.
[4] SHAYAA S, JAAFAR N I, BAHRI S, et al. Sentiment analysis of big data: methods, applications, and open challenges[J]. IEEE Access, 2018, 6: 37807-37827.
[5] YING C C, WU Z, DAI X Y, et al. Opinion transmission network for jointly improving aspect-oriented opinion words extraction and sentiment classification[C]//Proceedings of the 9th CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2020: 629-640.
[6] LI R F, CHEN H, FENG F X, et al. Dual graph convolutional networks for aspect-based sentiment analysis[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 6319-6329.
[7] D’MELLO S K, KORY J. A review and meta-analysis of multimodal affect detection systems[J]. ACM Computing Surveys, 2015, 47(3): 1-36.
[8] DAS R, SINGH T D. Multimodal sentiment analysis: a survey of methods, trends, and challenges[J]. ACM Computing Surveys, 2023, 55(13S): 1-38.
[9] ZHANG Y Z, SONG D W, LI X, et al. A quantum-like multimodal network framework for modeling interaction dynamics in multiparty conversational sentiment analysis[J]. Information Fusion, 2020, 62: 14-31.
[10] YOU Q Z, LUO J B, JIN H L, et al. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia[C]//Proceedings of the 9th ACM International Conference on Web Search and Data Mining. New York: ACM, 2016: 13-22.
[11] FEDUS W, ZOPH B, SHAZEER N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(1): 5232-5270.
[12] ZHANG Q, FU J L, LIU X Y, et al. Adaptive co-attention network for named entity recognition in tweets[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 5674-5681.
[13] LI Y, DING H, LIN Y M, et al. Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis[J]. Artificial Intelligence Review, 2024, 57(4): 78.
[14] YANG L, NA J C, YU J F. Cross-modal multitask transformer for end-to-end multimodal aspect-based sentiment analysis[J]. Information Processing & Management, 2022, 59(5): 103038.
[15] LING Y, YU J F, XIA R. Vision-language pre-training for multimodal aspect-based sentiment analysis[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 2149-2159.
[16] JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79-87.
[17] JORDAN M I, JACOBS R A. Hierarchical mixtures of experts and the EM algorithm[J]. Neural Computation, 1994, 6(2): 181-214.
[18] SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer[C]//Proceedings of the 5th International Conference on Learning Representations, 2017.
[19] SHAZEER N, CHENG Y L, PARMAR N, et al. Mesh-TensorFlow: deep learning for supercomputers[C]//Advances in Neural Information Processing Systems 31, 2018: 10435-10444.
[20] LEPIKHIN D, LEE H J, XU Y, et al. GShard: scaling giant models with conditional computation and automatic sharding[C]//Proceedings of the 9th International Conference on Learning Representations, 2021.
[21] CELIK O, ZHOU D, LI G, et al. Specializing versatile skill libraries using local mixture of experts[C]//Proceedings of the 2021 Conference on Robot Learning, 2021: 1423-1433.
[22] CHEN Z T, SHEN Y K, DING M Y, et al. Mod-Squad: designing mixtures of experts as modular multi-task learners[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 11828-11837.
[23] LEWIS M, BHOSALE S, DETTMERS T, et al. Base layers: simplifying training of large, sparse models[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 6265-6274.
[24] MUSTAFA B, RIQUELME C, PUIGCERVER J, et al. Multimodal contrastive learning with LIMoE: the language-image mixture of experts[C]//Advances in Neural Information Processing Systems 35, 2022: 9564-9576.
[25] RIQUELME C, PUIGCERVER J, MUSTAFA B, et al. Scaling vision with sparse mixture of experts[C]//Advances in Neural Information Processing Systems 34, 2021: 8583-8595.
[26] FAN A, BHOSALE S, SCHWENK H, et al. Beyond English-centric multilingual machine translation[J]. Journal of Machine Learning Research, 2021, 22(1): 4839-4886.
[27] CAO B, SUN Y M, ZHU P F, et al. Multi-modal gated mixture of local-to-global experts for dynamic image fusion[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 23498-23507.
[28] YU J F, JIANG J, YANG L, et al. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3342-3352.
[29] JU X C, ZHANG D, XIAO R, et al. Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 4395-4405.
[30] YU J F, JIANG J. Adapting BERT for target-oriented multimodal sentiment classification[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019: 5408-5414.
[31] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8748-8763.
[32] HAZARIKA D, PORIA S, ZADEH A, et al. Conversational memory network for emotion recognition in dyadic dialogue videos[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2018: 2122-2132.
[33] CHEN P, SUN Z Q, BING L D, et al. Recurrent attention network on memory for aspect sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 452-461.
[34] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 1103-1114.
[35] XU N, MAO W J, CHEN G D. Multi-interactive memory network for aspect based multimodal sentiment analysis[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 371-378.
[36] YU J F, JIANG J, XIA R. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 28: 429-439.