Bimodal Emotion Recognition Model for Speech-Text Based on Bi-LSTM-CNN

doi:10.3778/j.issn.1002-8331.2105-0042

Abstract

Abstract: To address the problem of low accuracy of single-modal emotion recognition, a speech-text bimodal emotion recognition model algorithm based on Bi-LSTM-CNN is proposed. The algorithm uses a Bi-LSTM（bi-directional long short-term memory network） with word embedding and a CNN（convolutional neural network） to form a Bi-LSTM-CNN model for text feature extraction, and the fusion results with acoustic features are used as the input of the joint CNN model for speech emotion computation. The test results based on the IEMOCAP multimodal emotion detection dataset show that the emotion recognition accuracy reaches 69.51%, which is at least 6 percentage points better than the single text modality model.

Key words: speech emotion recognition, convolutional neural network（CNN）, long short-term memory（LSTM）, feature fusion

摘要： 针对单一模态情感识别精度低的问题，提出了基于Bi-LSTM-CNN的语音文本双模态情感识别模型算法。该算法采用带有词嵌入的双向长短时记忆网络（bi-directional long short-term memory network，Bi-LSTM）和卷积神经网络（convolutional neural network，CNN）构成Bi-LSTM-CNN模型，实现文本特征的提取，将其与声学特征融合结果作为联合CNN模型的输入，进行语音情感计算。基于IEMOCAP多模态情感检测数据集的测试结果表明，情感识别准确率达到了69.51%，比单一模态模型提高了至少6个百分点。

关键词: 语音情感识别, 卷积神经网络（CNN）, 长短时记忆网络（LSTM）, 特征融合

WANG Lanxin, WANG Weiya, CHENG Xin. Bimodal Emotion Recognition Model for Speech-Text Based on Bi-LSTM-CNN[J]. Computer Engineering and Applications, 2022, 58(4): 192-197.

王兰馨, 王卫亚, 程鑫. 结合Bi-LSTM-CNN的语音文本双模态情感识别模型[J]. 计算机工程与应用, 2022, 58(4): 192-197.

References

[1] 孙晓虎，李洪均.语音情感识别综述[J].计算机工程与应用，2020，56（11）：1-9.
SUN X H，LI H J.Overview of speech emotion recognition[J].Computer Engineering and Applications，2020，56（11）：1-9.
[2] ZHAO J F，MAO X，CHEN L J.Learning deep features to recognize speech emotion using merged deep CNN[J].IET Signal Processing，2018，12（6）：713-721.
[3] CHAO L，TAO J，YANG M，et al.Long short term memory recurrent neural network based encoding method for emotion recognition in video[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing，2016：2752-2756.
[4] HUSAM A，HALA B W，ABDULM J.A new proposed statistical feature extraction method in speech emotion recognition[J].Computers and Electrical Engineering，2021，93：107172.
[5] ZHANG S Q，TAO X，CHUANG Y，et al.Learning deep multimodal affective features for spontaneous speech emotion recognition[J].Speech Communication，2021，127：73-81.
[6] 饶元，吴连伟，王一鸣，等.基于语义分析的情感计算技术研究进展[J].软件学报，2018，29（8）：2397-2426.
RAO Y，WU L W，WANG Y M，el at.Research progress on emotional computation technology based on semantic analysis[J].Journal of Software，2018，29（8）：2397-2426.
[7] LEE C W，SONG K Y，JEONG J，et al.Convolutional attention networks for multimodal emotion recognition from speech and text data[C]//Grand Challenge and Workshop on Human Multimodal Language，2018：28-24.
[8] GU Y，CHEN S，MARSIC I.Deep multimodal learning for emotion recognition in spoken language[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing，2018.
[9] 陈鹏展，张欣，徐芳萍.基于语音信号与文本信息的双模态情感识别[J].华东交通大学学报，2017，34（2）：100-104.
CHEN P Z，ZHANG X，XU F P.Multimodal emotion recognition based on speech signal and text information[J].Journal of East China Jiaotong University，2017，34（2）：100-104.
[10] 胡婷婷，沈凌洁，冯亚琴，等.语音与文本情感识别中愤怒与开心误判分析[J].计算机技术与发展，2018，28（11）：124-127.
HU T T，SHEN L J，FENG Y Q，et al.Research on anger and happy misclassification in speech and text emotion recognition[J].Computer Technology and Development，2018，28（11）：124-127.
[11] SAIBAT T N，VINYALS O，SENIOR A，et al.Convolutional，long short-term memory，fully connected deep neural networks[C]//2015 IEEE International Conference on Acoustics，Speech and Signal Processing，2015：4580-4584.
[12] TRIGEORGIS G，RINGEVAL F，BRUECKNER R，et al.Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing，2016：5200-5204.
[13] XU Z，CHEN B Z，SHENG H C，et al.A text-driven aircraft fault diagnosis model based on a Word2vec and priori-knowledge convolutional neural network[J].Aerospace，2021，8（4）：1-16.
[14] 黄鹤，荆晓远，董西伟，等.基于Skip-gram的CNNs文本邮件分类模型[J].计算机技术与发展，2019，29（6）：143-147.
HUANG H，JING X X，DONG X W，et al.CNNs-highway text message classification model based on Skip-gram[J].Computer Technology and Development，2019，29（6）：143-147.
[15] WANG R，LI Z，CAO J，et al.Convolutional recurrent neural networks for text classification[C]//2019 International Joint Conference on Neural Networks，2019：1-6.
[16] EYBEN F，WENINGER F，GROSS F，et al.Recent developments in openSMILE，the Munich open-source multimedia feature extractor[C]//21st ACM International Conference on Multimedia，2013：835-838.
[17] PORIA S，CAMBRIA E，HAZARIKA D，et al.Context-dependent sentiment analysis in user-generated videos[C]//55th Annual Meeting of the Association for Computational Linguistics，2017：873-883.
[18] BUSSO C，BULUT M，LEE C，et al.IEMOCAP：interactive emotional dyadic motion capture database[J].Language Resources and Evaluation，2008，42（4）：335-359.