跨模态语义时空动态交互情感分析研究

doi:10.3778/j.issn.1002-8331.2207-0498

摘要/Abstract

摘要： 针对传统情感分析中存在的模态间交互性差、时空特征融合度低的问题，建立了一种跨模态的语义时空动态交互网络。通过引入双向长短期记忆网络挖掘各模态的时间序列特征，加入自注意力机制强化模态内特征的权重赋值，将自动筛选出的特征矩阵送入图卷积神经网络进行语义交互。然后以时间戳为基础进行特征聚合，计算聚合层的相关系数，获得融合后的联合特征，实现跨模态空间交互，最终完成情感极性的分类与预测。使用公开数据集对所提出的模型进行评估验证，实验结果表明，多模态时间序列提取和跨模态语义空间交互机制可以实现模态内和模态间特征的全动态融合，有效地提升了情感分类的准确率和F1值，在CMU-MOSEI数据集上分别提高了1.7%~13.5%和2.1%~14.0%，表现出良好的健壮性和先进性。

关键词: 跨模态情感分析, 语义交互, 时空交互, 双向长短期记忆网络, 图卷积网络

Abstract: Considering the problems of poor interaction between multimodal and low fusion of spatial and temporal features in traditional sentiment analysis, a semantic and spatio-temporal dynamic interaction network of cross-modal is proposed. By introducing bi-directional long short-term memory, the time series features of each modality are mined. Meanwhile, a self-attention mechanism is added to strengthen the weight distribution of features within the modality, and the automatically screened feature matrix is sent to the graph convolutional neural networks for semantic interaction. Then, based on the timestamp, the feature aggregation is carried out, the correlation coefficient of the aggregation layer is calculated, and the fused features are obtained to realize cross-modal space interaction. Finally the classification and prediction of emotional polarity are performed. The proposed model is evaluated and verified using public datasets. The experimental results show that multi-modal time series extraction and cross-modal semantic space interaction mechanism can achieve full dynamic fusion of intra-modal and inter-modal features, and effectively improve the accuracy and F1 value of sentiment classification. On the CMU-MOSEI dataset they have increased by 1.7%~13.5% and 2.1%~14.0% respectively, showing good robustness and advancement.

Key words: cross modal sentiment analysis, semantic interaction, spatio-temporal interaction, bi-directional long short-term memory, graph convolutional network

屈立成, 郤丽媛, 刘紫君, 魏思, 董哲为. 跨模态语义时空动态交互情感分析研究[J]. 计算机工程与应用, 2024, 60(1): 165-173.

QU Licheng, QIE Liyuan, LIU Zijun, WEI Si, DONG Zhewei. Cross-Modal Emotion Analysis of Semantic and Spatio-Temporal Dynamic Interaction[J]. Computer Engineering and Applications, 2024, 60(1): 165-173.

参考文献

[1] 张亚洲, 戎璐, 宋大为, 等. 多模态情感分析研究综述[J]. 模式识别与人工智能, 2020, 33(5): 426-438.
ZHANG Y Z, RONG L, SONG D W, et al. A survey on multimodal sentiment analysis[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 426-438.
[2] ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
[3] KAUR R, KAUTISH S. Multimodal sentiment analysis: a survey and comparison[J]. International Journal of Service Science, Management, Engineering, and Technology (IJSSMET), 2019, 10(2): 38-58.
[4] XU N, MAO W. Multisentinet: a deep semantic network for multimodal sentiment analysis[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, Nov 6-10, 2017. New York: ACM, 2017: 2399-2402.
[5] ZHANG K E, ZHU Y, ZHANG W, et al. Cross-modal image sentiment analysis via deep correlation of textual semantic[J]. Knowledge-Based Systems, 2021, 216: 106803.
[6] 刘路路, 杨燕, 王杰. ABAFN: 面向多模态的方面级情感分析模型[J]. 计算机工程与应用, 2022, 58(10): 193-199.
LIU L L, YANG Y, WANG J. ABAFN: aspect-based sentiment analysis model for multimodal[J]. Computer Engineering and Applications, 2022, 58(10): 193-199.
[7] YANG B, SHAO B, WU L, et al. Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130-137.
[8] HUANG Z, LIU F, WU X, et al. Audio-oriented multimodal machine comprehension via dynamic inter-and intra-modality attention[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, Feb 2-9, 2021. Menlo Park, CA: Association for the Advancement of Artificial Intelligence, 2021: 13098-13106.
[9] KUMAR A, VEPA J. Gated mechanism for attention based multi modal sentiment analysis[C]//ICASSP 2020: International Conference on Acoustics, Speech and Signal Processing, Virtual Barcelona, May 4-8, 2020. Piscataway, NJ: IEEE, 2020: 4477-4481.
[10] WEN H, YOU S, FU Y. Cross-modal context-gated convolution for multi-modal sentiment analysis[J]. Pattern Recognition Letters, 2021, 146: 252-259.
[11] LI Z, GUO Q, FENG C, et al. Multimodal sentiment analysis based on interactive transformer and soft mapping[J]. Wireless Communications and Mobile Computing, 2022: 6243347.
[12] GHORBANALI A, SOHRABI M K, YAGHMAEE F. Ensemble transfer learning-based multimodal sentiment analysis using weighted convolutional neural networks[J]. Information Processing & Management, 2022, 59(3): 102929.
[13] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707. 07250, 2017.
[14] YAN X, XUE H, JIANG S, et al. Multimodal sentiment analysis using multi?tensor fusion network with cross-modal modeling[J]. Applied Artificial Intelligence, 2021: 1-16.
[15] PENG C, ZHANG C, XUE X, et al. Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification[J]. Tsinghua Science and Technology, 2021, 27(4): 664-679.
[16] 张峰, 李希城, 董春茹, 等. 基于深度情感唤醒网络的多模态情感分析与情绪识别[J]. 控制与决策, 2022, 37(11): 2984-2992.
ZHANG F, LI X C, DONG C R, et al. Deep emotional arousal network for multimodal sentiment analysis and emotion recognition[J]. Control and Decision, 2022, 37(11): 2984-2992.
[17] TANG J, LI K, JIN X, et al. CTFN: hierarchical learning for multimodal sentiment analysis using coupled-translation fusion network[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, Aug 1-6, 2021. New York: Association for Computing Linguistics, 2021: 5301-5311.
[18] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: a simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958.
[19] ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, July 15-20, 2018. Stroudsburg, PA: Association for Computational Linguistics, 2018: 2236-2246.
[20] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Jul 30-Aug 4, 2017. New York: Association for Computational Linguistics, 2017: 873-883.
[21] WANG Y, SHEN Y, LIU Z, et al. Words can shift: dynamically adjusting word representations using nonverbal behaviors[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, Jan 27- Feb 1, 2019. Menlo Park: AAAI, 2019, 33(1): 7216-7223.
[22] PHAM H, LIANG P P, MANZINI T, et al. Found in translation: learning robust joint representations by cyclic translations between modalities[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, Jan 27- Feb 1, 2019. Menlo Park: AAAI, 2019: 6892-6899.
[23] HUANG J, LIN Z, YANG Z, et al. Temporal graph convolutional network for multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, Oct 18 2021. New York: Association for Computing Machinery, 2021: 239-247.
[24] Li Q, GKOUMAS D, LIOMA C, et al. Quantum-inspired multimodal fusion for video sentiment analysis[J]. Information Fusion, 2021, 65: 58-71.