动态时间序列建模的多模态情感识别方法

doi:10.3778/j.issn.1002-8331.2308-0231

摘要/Abstract

摘要： 现有的情感识别研究未充分考虑语音信号中的局部-全局信息和长期时间依赖关系，文本特征提取也存在特征稀疏和信息丢失的问题。为解决上述问题，提出动态时间序列建模的多模态情感识别方法。设计动态时间窗口模块分割语音信号从而捕捉局部-全局信息，并通过双向序列建模捕获信号中的空间信息。考虑到文本信息对情感分析的重要性，采用基于Transformer模型的卷积神经网络捕捉文本中不同位置间的依赖关系建模较长的上下文信息，最后将两种模态进行融合得到最终的情感分类。模型在IEMOCAP数据集上的实验结果表明，相比其他主流模型具有更好的多模态情感识别效果。

关键词: 多模态情感分析, 动态时间窗口, 双向时间序列建模, 卷积神经网络, 多模态融合

Abstract: Existing emotion recognition studies have not fully considered the local-global information and long-term time dependencies in speech signals, and text feature extraction also suffers from feature sparsity and information loss. To solve the above problems, multimodal emotion recognition method based on dynamic time sequence modeling is proposed. The dynamic time window module is designed to segment the speech signal so as to capture the local-global information, and the spatial information in the signal is captured by bi-directional sequence modelling. Considering the importance of text information for emotion analysis, a convolutional neural network based on the Transformer model is used to capture the longer contextual information by modelling the dependencies between different locations in the text, and finally the two modalities are fused to obtain the final emotion classification. The experimental results of the model on the IEMOCAP dataset show better multimodal emotion recognition compared to other mainstream models.

Key words: multimodal sentiment analysis, dynamic time window, bidirectional time sequence modeling, convolutional neural networks, multimodal fusion

李佳泽, 梅红岩, 贾丽云, 李文娅. 动态时间序列建模的多模态情感识别方法[J]. 计算机工程与应用, 2025, 61(1): 196-205.

LI Jiaze, MEI Hongyan, JIA Liyun, LI Wenya. Multimodal Emotion Recognition Method Based on Dynamic Time Sequence Modeling[J]. Computer Engineering and Applications, 2025, 61(1): 196-205.

参考文献

[1] DELLAERT F, POLZIN T, WAIBEL A. Recognizing emotion in speech[C]//Proceedings of the Fourth International Conference on Spoken Language, 1996: 1970-1973.
[2] SCHULLER B W. Speech emotion recognition two decades in a nutshell, benchmarks, and ongoing trends[J]. Communications of the ACM, 2018, 61(5): 90-99.
[3] SCHULLER B, VALSTAR M, COWIE R, et al. AVEC 2012—the continuous audio/visual emotion challenge[C]//Proceedings of the 2nd International Audio/Visual Emotion Challenge and Workshop, 2012: 449-456.
[4] TRIPATHI S, KUMAR A, RAMESH A, et al. Deep learning based emotion recognition system using speech features and transcriptions[J]. arXiv:1906.05681, 2019.
[5] WANG J, XUE M, CULHANE R, et al. Speech emotion recognition with dual-sequence LSTM architecture[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 6474-6478.
[6] LEE S, HAN D K, KO H. Fusion-ConvBERT: parallel convolution and BERT fusion for speech emotion recognition[J]. Sensors, 2020, 20(22): 6688.
[7] YE J, WEB X C, WEI Y. Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, 2023: 1-5.
[8] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. Computer Science, 2013: 1301-3781.
[9] PENNINGTON J, SOCHER R, MANNING C. GloVe: global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[10] DEVLIN J, CHANG M W, LEE K. BERT: pre-training of deep bidirectional Transformers for language understanding[J]. arXiv:1810.04805, 2018.
[11] DAI Z, LAI G, YANG Y, et al. Funnel-Transformer: filtering out sequential redundancy for efficient language processing[C]//Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, Canada, 2020.
[12] WANG Y, HUANG G, LI M, et al. Automatically constructing a fine-grained sentiment lexicon for sentiment analysis[J]. Cognitive Computation, 2022, 15(1): 254-271.
[13] JASSIM M A, ABD D H, OMRI M N. A survey of sentiment analysis from film critics based on machine learning, lexicon and hybridization[J]. Neural Computing and Applications, 2023, 35(13): 9437-9461.
[14] XU D, TIAN Z, LAI R, et al. Deep learning based emotion analysis of microblog texts[J]. Information Fusion, 2020, 64: 1-11.
[15] CHEN J, SUN C, ZHANG S, et al. Cross-modal dynamic sentiment annotation for speech sentiment analysis[J]. Computers & Electrical Engineering, 2023, 106: 108598.
[16] GU Y, CHEN S, MARSIC I. Deep multimodal learning for emotion recognition in spoken language[C]//Proceedings of the IEEE International Conference on Acoustics，Speech and Signal Processing, 2018: 5079-5083.
[17] ATMAJA B T, SASOU A, AKAGI M. Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion[J]. Speech Communication, 2022, 140: 11-28.
[18] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context- dependent sentiment analysis in user-generated videos[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 873-883.
[19] HAZARIKA D, PORIA S, MIHALCEA R, et al. ICON: interactive conversational memory network for multimodal emotion detection[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 2594-2604.
[20] MAJUMDER N, PORIA S, HAZARIKA D, et al. DialogueRNN: an attentive RNN for emotion detection in conversations[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 6818-6825.
[21] GHOSAL D, MAJUMDER N, PORIA S, et al. DialogueGCN: a graph convolutional neural network for emotion recognition in conversation[J]. arXiv:1908.11540, 2019.
[22] WANG T, HOU Y, ZHOU D. A contextual attention network for multimodal emotion recognition in conversation[C]//Proceedings of the 2021 International Joint Conference on Neural Networks, Shenzhen, China, 2021: 1-7.
[23] JOSHI A, BHAT A, JAIN A, et al. COGMEN: contextuali-zed GNN based multimodal emotion recognition[J]. arXiv:2205.02455, 2022.
[24] LAN Z, CHEN M, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[J]. arXiv:1909.11942, 2019.
[25] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[J]. arXiv:1907.11692, 2019.