[1] MINSKY M. The emotion machine: commonsense thinking, artificial intelligence, and the future of the human mind[M]. New York: Simon & Schuster, 2007.
[2] SCHULLER B W. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends[J]. Communications of the ACM, 2018, 61(5): 90-99.
[3] LI H, ZHANG X, WANG M J. Research on speech emotion recognition based on deep neural network[C]//Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing, 2021: 795-799.
[4] LI Z, TANG F, SUN T, et al. SEOVER: sentence-level emotion orientation vector based conversation emotion recognition model[C]//Proceedings of the 28th International Conference on Neural Information Processing, Sanur, Dec 8-12, 2021: 468-475.
[5] PENG Z, LI X, ZHU Z, et al. Speech emotion recognition using 3D convolutions and attention-based sliding recurrent networks with auditory front-ends[J]. IEEE Access, 2020, 8: 16560-16572.
[6] YOON S, BYUN S, JUNG K. Multimodal speech emotion recognition using audio and text[C]//Proceedings of the 2018 IEEE Spoken Language Technology Workshop, 2018: 112-118.
[7] 李紫荆, 陈宁. 基于图神经网络多模态融合的语音情感识别模型[J]. 计算机应用研究, 2023, 40(8): 2286-2291.
LI Z J, CHEN N. Speech emotion recognition model based on multimodal fusion of graph neural network[J]. Applica-tion Research of Computers, 2023, 40(8): 2286-2291.
[8] HU J, LIU Y, ZHAO J, et al. MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 5666-5675.
[9] MCFEE B, RAFFEL C, LIANG D, et al. librosa: audio and music signal analysis in python[C]//Proceedings of the 14th Python in Science Conference, 2015: 18-25.
[10] EYBEN F, W?LLMER M, SCHULLER B. openSMILE: the Munich versatile and fast open-source audio feature extractor[C]//Proceedings of the 18th ACM International Conference on Multimedia, 2010: 1459-1462.
[11] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Con-ference on Computer Vision and Pattern Recognition, 2016: 770-778.
[12] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018.
[13] KUMAR A, VEPA J. Gated mechanism for attention based multimodal sentiment analysis[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 4477-4481.
[14] SUN L, LIU B, TAO J, et al. Multimodal cross-and self-attention network for speech emotion recognition[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, 2021: 4275-4279.
[15] WU C H, LIN J C, WEI W L. Survey on audiovisual emo-tion recognition: databases, features, and data fusion strate-gies[J]. APSIPA Transactions on Signal and Information Processing, 2014, 3: e12.
[16] BALTRU?AITIS T, AHUJA C, MORENCY L P. Multi-modal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 423-443.
[17] GLODEK M, TSCHECHNE S, LAYHER G, et al. Multiple classifier systems for the classification of audio-visual emotional states[C]//Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction, Memphis, Oct 9-12, 2011. Berlin, Heidelberg: Springer, 2011: 359-368.
[18] 任欢, 王旭光. 注意力机制综述[J]. 计算机应用, 2021, 41(S1): 1-6.
REN H, WANG X G. Summary of attention mechanism[J]. Journal of Computer Application, 2021, 41(S1): 1-6.
[19] ROGERS A, KOVALEVA O, RUMSHISKY A. A primer in BERTology: what we know about how BERT works[J]. Transactions of the Association for Computational Linguis-tics, 2021, 8: 842-866.
[20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Pro-cessing Systems 30, 2017.
[21] PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014: 1532-1543.
[22] PEPINO L, RIERA P, FERRER L, et al. Fusion approaches for emotion recognition from speech using acoustic and text-based features[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 6484-6488.
[23] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[24] GAO T, YAO X, CHEN D. SimCSE: simple contrastive learning of sentence embeddings[J]. arXiv:2104.08821, 2021.
[25] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 1597-1607.
[26] 张重生, 陈杰, 李岐龙, 等. 深度对比学习综述[J]. 自动化学报, 2023, 49(1): 15-39.
ZHANG C S, CHEN J, LI Q L, et al. A review of deep comparative learning[J]. Acta Automatica Sinica, 2023, 49(1): 15-39.
[27] KHOSLA P, TETERWAK P, WANG C, et al. Supervised contrastive learning[C]//Advances in Neural Information Processing Systems 33, 2020: 18661-18673.
[28] 李希, 刘喜平, 李旺才, 等. 对比学习研究综述[J]. 小型微型计算机系统, 2023, 44(4): 787-797.
LI X, LIU X P, LI W C, et al. A review of comparative learning research[J]. Journal of Chinese Computer Systems, 2023, 44(4): 787-797.
[29] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: inter-active emotional dyadic motion capture database[J]. Lan-guage Resources and Evaluation, 2008, 42: 335-359.
[30] NEUMANN M, VU N T. Improving speech emotion recognition with unsupervised representation learning on unlabeled speech[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Pro-cessing, 2019: 7390-7394.
[31] KINGMA D P, BA J. Adam: a method for stochastic opti-mization[J]. arXiv:1412.6980, 2014.
[32] LI H, DING W, WU Z, et al. Learning fine-grained cross modality excitement for speech emotion recognition[J]. arXiv:2010.12733, 2020.
[33] MAKIUCHI M R, UTO K, SHINODA K. Multimodal emo-tion recognition with high-level speech and text features[C]//Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop, 2021: 350-357.
[34] CHEN W, XING X, XU X, et al. Key-sparse transformer for multimodal speech emotion recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 6897-6901.
[35] KHURANA Y, GUPTA S, SATHYARAJ R, et al. Robin-Net: a multimodal speech emotion recognition system with speaker recognition for social interactions[J]. IEEE Transac-tions on Computational Social Systems, 2024, 11(1): 478-487.
[36] HOU M, ZHANG Z, LU G. Multi-modal emotion recogni-tion with self-guided modality calibration[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 4688-4692.
[37] DUTTA S, GANAPATHY S. Multimodal transformer with learnable frontend and self attention for emotion recognition[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022: 6917-6921. |