Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (4): 192-197.DOI: 10.3778/j.issn.1002-8331.2105-0042

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Bimodal Emotion Recognition Model for Speech-Text Based on Bi-LSTM-CNN

WANG Lanxin, WANG Weiya, CHENG Xin   

  1. School of Information Engineering, Chang’an University, Xi’an 710064, China
  • Online:2022-02-15 Published:2022-02-15

结合Bi-LSTM-CNN的语音文本双模态情感识别模型

王兰馨,王卫亚,程鑫   

  1. 长安大学 信息工程学院,西安 710064

Abstract: To address the problem of low accuracy of single-modal emotion recognition, a speech-text bimodal emotion recognition model algorithm based on Bi-LSTM-CNN is proposed. The algorithm uses a Bi-LSTM(bi-directional long short-term memory network) with word embedding and a CNN(convolutional neural network) to form a Bi-LSTM-CNN model for text feature extraction, and the fusion results with acoustic features are used as the input of the joint CNN model for speech emotion computation. The test results based on the IEMOCAP multimodal emotion detection dataset show that the emotion recognition accuracy reaches 69.51%, which is at least 6 percentage points better than the single text modality model.

Key words: speech emotion recognition, convolutional neural network(CNN), long short-term memory(LSTM), feature fusion

摘要: 针对单一模态情感识别精度低的问题,提出了基于Bi-LSTM-CNN的语音文本双模态情感识别模型算法。该算法采用带有词嵌入的双向长短时记忆网络(bi-directional long short-term memory network,Bi-LSTM)和卷积神经网络(convolutional neural network,CNN)构成Bi-LSTM-CNN模型,实现文本特征的提取,将其与声学特征融合结果作为联合CNN模型的输入,进行语音情感计算。基于IEMOCAP多模态情感检测数据集的测试结果表明,情感识别准确率达到了69.51%,比单一模态模型提高了至少6个百分点。

关键词: 语音情感识别, 卷积神经网络(CNN), 长短时记忆网络(LSTM), 特征融合