基于级联双通道分阶段融合的双模态情感识别

doi:10.3778/j.issn.1002-8331.2111-0542

摘要/Abstract

摘要： 为充分提取文本和语音双模态深层情感特征，解决模态间有效交互融合的问题，提高情感识别准确率，提出了基于级联双通道分阶段融合（cascade two channel and phased fusion，CTC-PF）的双模态情感识别模型。设计级联顺序注意力编码器（cascaded sequential attention-Encoder，CSA-Encoder）对长距离语音情感序列信息进行并行化计算，提取深层语音情感特征；提出情感领域级联编码器（affective field cascade-Encoder，AFC-Encoder），提高模型的全局和局部文本理解能力，解决文本关键情感特征稀疏的问题。两个级联通道完成语音和文本信息的特征提取之后，利用协同注意力机制对两者的重要情感特征进行交互融合，降低对齐操作成本，然后采用哈达玛点积对其进行二次融合，捕获差异性特征，分阶段融合实现不同时间步长模态序列间的信息交互，解决双模态情感信息交互不足的问题。模型在IEMOCAP数据集上进行分类实验，结果表明，情感识别准确率可达79.4%，F1值可达79.0%，相比现有主流方法有明显提升，证明了该模型在语音和文本融合情感识别上的优越性。

关键词: 双模态情绪识别, 级联编码器, 分阶段融合, 信息交互

Abstract: In order to fully extract the deep emotional features of text and speech and solve the problem of effective interactive fusion between this two modals, a bimodel emotion recognition model based on cascade two channel and phased fusion（CTC-PF） is proposed. First, the cascaded sequential attention-Encoder（CSA-Encoder） is designed to compute the long-distance speech emotion sequence information in parallel and extract the deep-level speech emotion feature. Besides, the affective field cascade-Encoder（AFC-Encoder） is designed to improve the text feature extractor’s global and local text understanding abilities and solve the problem of sparse key emotional features of text. After this two cascaded channels model completing the feature extraction of speech and text information, the collaborative attention mechanism is used to interactively integrate the important emotional features of this two modals, which aim to reduce the cost of alignment operations, and then the Hadamard dot product is designed to perform secondary fusion to capture the difference features and solve the problem of insufficiency of emotional information interaction between this two modals, phased fusion realizes the information interaction between modal sequences of different time steps. The emotion recognition model performs classification experiments on the IEMOCAP dataset. The results show that the accuracy of emotion recognition can reach 79.4%, and the F1-score can reach 79.0%. Compared with the existing mainstream methods, the performance of the proposed model is significantly improved, which proves the proposed fusion model is in a high superiority of speech and text bimodal emotion recognition.

Key words: bimodal emotion recognition, cascaded encoder, phased fusion, information interaction

徐志京, 刘霞. 基于级联双通道分阶段融合的双模态情感识别[J]. 计算机工程与应用, 2023, 59(8): 127-137.

XU Zhijing, LIU Xia. Bimodal Emotion Recognition Model Based on Cascaded Two Channel Phased Fusion[J]. Computer Engineering and Applications, 2023, 59(8): 127-137.

参考文献

[1] HAN W，RUAN H，CHEN X，et al.Towards temporal modelling of categorical speech emotion recognition[C]//Proc Interspeech，2018：932-936.
[2] EYBEN F，WENINGER F，GROSS F，et al.Recent developments in openSMILE，the munich open-source multimedia feature extractor[C]//Proceedings of the 21st ACM International Conference on Multimedia，2013：835-838.
[3] HAN K，YU D，TASHEV I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Proc Interspeech，2014：223-227.
[4] TRIGEORGIS G，RINGEVAL F，BRUECKNER R，et al.Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing，2016：5200-5204.
[5] YANG C，YIH LIN K H，CHEN H H.Emotion classification using Web blog corpora[C]//IEEE/WIC/ACM International Conference on Web Intelligence，2007：275-278.
[6] PENNINGTON J，SOCHER R，MANNING C.Glove：global vectors for word representation[C]//Proc Conf Empirical Methods Natural Lang Process（EMNLP），2014：1532-1543.
[7] MIKOLOV T，SUTSKEVER I，CHEN K，et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems，2013：3111-3119.
[8] PORIA S，CHATURVEDI I.Convolutional MKL based multimodal emotion recognition and sentiment analysis[C]//2016 IEEE 16th International Conference on Data Mining，2016.
[9] JIAO W，YANG H，KING I，et al.HiGRU：hierarchical gated recurrent units for utterance-level emotion recognition[C]//NAACL，2019.
[10] YOON S，BYUN S，JUNG K.Multimodal speech emotion recognition using audio and text[C]//2018 IEEE Spoken Language Technology Workshop（SLT），2018：112-118.
[11] TRIPATHI S，KUMAR A，RAMESH A，et al.Deep learning based emotion recognition system using speech features and transcriptions[J].arXiv：1906.05681v1，2019.
[12] 徐志京，高姗.基于Transformer-ESIM注意力机制的多模态情绪识别[J].计算机工程与应用，2022，58（10）：132-138.
XU Zhijing，GAO Shan.Multi-modal emotion recognition based on Transformer-ESIM attention mechanism[J].Computer Engineering and Applications，2022，58（10）：132-138.
[13] XU Haiyang，ZHANG Hui，HAN Kun，et al.Learning alignment for multimodal emotionrecognition from speech[C]//Proc Interspeech，2019：3569-3573.
[14] YOON S，BYUN S，DEY S，et al.Speech emotion recognition using multi-hop attention mechanism[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2019：2822-2826.
[15] SIRIWARDHANA S，KALUARACHCHI T，BILLINGHURST M，et al.Multimodal emotion recognition with transformer-based self supervised feature fusion[J].IEEE Access，2020，8：176274-176285.
[16] SUN L，LIU B，TAO J，et al.Multimodal cross- and self-attention network for speech emotion recognition[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：4275-4279.
[17] 王兰馨，王卫亚，程鑫.结合Bi-LSTM-CNN的语音文本双模态情感识别模型[J].计算机工程与应用，2022，58（4）：192-197.
WANG Lanxin，WANG Weiya，CHENG Xin.Bimodal emotion recognition model for speech-text based on Bi-LSTM-CNN[J].Computer Engineering and Applications，2022，58（4）：192-197.
[18] BUSSO C，BULUT M，LEE C C，et al.IEMOCAP：interactive emotional dyadic motion capture database[J].Journal of Lang Resources & Evaluation，2008，42：335-359.
[19] SINGH P，SRIVASTAVA R，RANA K P S.A multimodal hierarchical approach to speech emotion recognition from audio and text[J].Knowledge-Based Sytems，2021，229：107316.
[20] DENG J J，LEUNG C H C，LI Y.Multimodal emotion recognition using transfer learning on audio and text data[C]//Computational Science and Its Applications-ICCSA，2021：552-563.
[21] LEE Y，YOON S，JUNG K.Multimodal speech emotion recognition using cross attention with aligned audio and text[C]//Proc Interspeech，2020：2717-2721.
[22] LEE S，HAN D K，KO H.Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification[J].IEEE Access，2021，9：94557-94572.
[23] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[C]//NAACL-HLT（1），2019.
[24] PORIA S，MAJUMDER N，HAZARIKA D，et al.Multimodal sentiment analysis：addressing key issues and setting up the baselines[J].IEEE Intelligent Systems，2018，33（6）：17-25.
[25] EYBEN F，W?LLMER M，SCHULLER B.Opensmile：the munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia.Association for Computing Machinery，New York，NY，USA，2010：1459-1462.
[26] PEPINO L，RIERA P，FERRER L，et al.Fusion approaches for emotion recognition from speech using acoustic and text-based features[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing，2020：6484-6488.
[27] HAZARIKA D，PORIA S，MIHALCEA R，et al.Icon：interactive conversational memory network for multimodal emotion detection[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing，2018：2594-2604.
[28] HAZARIKA D，PORIA S，ZADEH A，et al.Conversational memory network for emotion recognition in dyadic dialogue videos[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，Volume 1（Long Papers），2018：2122-2132.