C-BGA：结合对比学习的多模态语音情感识别网络

doi:10.3778/j.issn.1002-8331.2305-0049

摘要/Abstract

摘要： 当前多模态语音情感识别（speech emotion recognition，SER）数据集规模较小，蕴含信息量较大，导致模型对各模态信息的拟合度不足，且无法挖掘出数据背后蕴含的信息。针对该问题，提出了基于对比学习的多模态语音情感分类网络。一方面在网络中引用跳连接（skip connections，SC）方法，有效解决了网络退化问题；另一方面借助对比学习（contrastive learning，CL）理论提出一种新的Loss计算方法，加快模型的拟合速度。模型在IEMOCAP数据集上进行实验，未加权精度（UA）为82.68%，加权精度（WA）为82.35%，实验结果表明了该模型的优越性。

关键词: 多模态, 语音情感识别, 对比学习, 注意力机制

Abstract: At present, the multimodal speech emotion recognition (SER) dataset is small in scale and contains a large amount of information, resulting in insufficient fitting of the model to each modal information, and the information behind the data cannot be excavated. Aiming at this problem, a multimodal speech emotion classification network based on contrastive learning is proposed. On the one hand, the method of skip connections (SC) is used in the network to effectively solve the problem of network degradation. On the other hand, a new Loss calculation method is proposed by means of contrastive learning (CL) theory to speed up the fitting speed of the model. The model is tested on the IEMOCAP dataset. The unweighted accuracy (UA) is 82.68%, and the weighted accuracy (WA) is 82.35%. According to the experimental results, the superiority of this model is demonstrated.

Key words: multimodal, speech emotion recognition, contrastive learning, attention mechanism

苗博瑞, 许云峰, 赵少杰, 王嘉麟. C-BGA：结合对比学习的多模态语音情感识别网络[J]. 计算机工程与应用, 2024, 60(16): 168-176.

MIAO Borui, XU Yunfeng, ZHAO Shaojie, WANG Jialin. C-BGA: Multimodal Speech Emotion Recognition Network Combining Contrastive Learning[J]. Computer Engineering and Applications, 2024, 60(16): 168-176.

参考文献

[1] MINSKY M. The emotion machine: commonsense thinking, artificial intelligence, and the future of the human mind[M]. New York: Simon & Schuster, 2007.
[2] SCHULLER B W. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends[J]. Communications of the ACM, 2018, 61(5): 90-99.
[3] LI H, ZHANG X, WANG M J. Research on speech emotion recognition based on deep neural network[C]//Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing, 2021: 795-799.
[4] LI Z, TANG F, SUN T, et al. SEOVER: sentence-level emotion orientation vector based conversation emotion recognition model[C]//Proceedings of the 28th International Conference on Neural Information Processing, Sanur, Dec 8-12, 2021: 468-475.
[5] PENG Z, LI X, ZHU Z, et al. Speech emotion recognition using 3D convolutions and attention-based sliding recurrent networks with auditory front-ends[J]. IEEE Access, 2020, 8: 16560-16572.
[6] YOON S, BYUN S, JUNG K. Multimodal speech emotion recognition using audio and text[C]//Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018: 112-118.
[7] 李紫荆, 陈宁. 基于图神经网络多模态融合的语音情感识别模型[J]. 计算机应用研究, 2023, 40(8): 2286-2291.
LI Z J, CHEN N. Speech emotion recognition model based on multimodal fusion of graph neural network[J]. Applica-tion Research of Computers, 2023, 40(8): 2286-2291.
[8] HU J, LIU Y, ZHAO J, et al. MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 5666-5675.
[9] MCFEE B, RAFFEL C, LIANG D, et al. librosa: audio and music signal analysis in python[C]//Proceedings of the 14th Python in Science Conference, 2015: 18-25.
[10] EYBEN F, W?LLMER M, SCHULLER B. openSMILE: the Munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia, 2010: 1459-1462.
[11] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Con-ference on Computer Vision and Pattern Recognition, 2016: 770-778.
[12] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018.
[13] KUMAR A, VEPA J. Gated mechanism for attention based multimodal sentiment analysis[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 4477-4481.
[14] SUN L, LIU B, TAO J, et al. Multimodal cross-and self-attention network for speech emotion recognition[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, 2021: 4275-4279.
[15] WU C H, LIN J C, WEI W L. Survey on audiovisual emo-tion recognition: databases, features, and data fusion strate-gies[J]. APSIPA Transactions on Signal and Information Processing, 2014, 3: e12.
[16] BALTRU?AITIS T, AHUJA C, MORENCY L P. Multi-modal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 423-443.
[17] GLODEK M, TSCHECHNE S, LAYHER G, et al. Multiple classifier systems for the classification of audio-visual emotional states[C]//Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction, Memphis, Oct 9-12, 2011. Berlin, Heidelberg: Springer, 2011: 359-368.
[18] 任欢, 王旭光. 注意力机制综述[J]. 计算机应用, 2021, 41(S1): 1-6.
REN H, WANG X G. Summary of attention mechanism[J]. Journal of Computer Application, 2021, 41(S1): 1-6.
[19] ROGERS A, KOVALEVA O, RUMSHISKY A. A primer in BERTology: what we know about how BERT works[J]. Transactions of the Association for Computational Linguis-tics, 2021, 8: 842-866.
[20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Pro-cessing Systems 30, 2017.
[21] PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[22] PEPINO L, RIERA P, FERRER L, et al. Fusion approaches for emotion recognition from speech using acoustic and text-based features[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 6484-6488.
[23] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[24] GAO T, YAO X, CHEN D. SimCSE: simple contrastive learning of sentence embeddings[J]. arXiv:2104.08821, 2021.
[25] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 1597-1607.
[26] 张重生, 陈杰, 李岐龙, 等. 深度对比学习综述[J]. 自动化学报, 2023, 49(1): 15-39.
ZHANG C S, CHEN J, LI Q L, et al. A review of deep comparative learning[J]. Acta Automatica Sinica, 2023, 49(1): 15-39.
[27] KHOSLA P, TETERWAK P, WANG C, et al. Supervised contrastive learning[C]//Advances in Neural Information Processing Systems 33, 2020: 18661-18673.
[28] 李希, 刘喜平, 李旺才, 等. 对比学习研究综述[J]. 小型微型计算机系统, 2023, 44(4): 787-797.
LI X, LIU X P, LI W C, et al. A review of comparative learning research[J]. Journal of Chinese Computer Systems, 2023, 44(4): 787-797.
[29] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: inter-active emotional dyadic motion capture database[J]. Lan-guage Resources and Evaluation, 2008, 42: 335-359.
[30] NEUMANN M, VU N T. Improving speech emotion recognition with unsupervised representation learning on unlabeled speech[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Pro-cessing, 2019: 7390-7394.
[31] KINGMA D P, BA J. Adam: a method for stochastic opti-mization[J]. arXiv:1412.6980, 2014.
[32] LI H, DING W, WU Z, et al. Learning fine-grained cross modality excitement for speech emotion recognition[J]. arXiv:2010.12733, 2020.
[33] MAKIUCHI M R, UTO K, SHINODA K. Multimodal emo-tion recognition with high-level speech and text features[C]//Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop, 2021: 350-357.
[34] CHEN W, XING X, XU X, et al. Key-sparse transformer for multimodal speech emotion recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 6897-6901.
[35] KHURANA Y, GUPTA S, SATHYARAJ R, et al. Robin-Net: a multimodal speech emotion recognition system with speaker recognition for social interactions[J]. IEEE Transac-tions on Computational Social Systems, 2024, 11(1): 478-487.
[36] HOU M, ZHANG Z, LU G. Multi-modal emotion recogni-tion with self-guided modality calibration[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 4688-4692.
[37] DUTTA S, GANAPATHY S. Multimodal transformer with learnable frontend and self attention for emotion recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 6917-6921.