Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (23): 163-170.DOI: 10.3778/j.issn.1002-8331.2104-0306

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Reserch of Multi-modal Emotion Recognition Based on Voice and Video Images

WANG Chuanyu, LI Weixiang, CHEN Zhenhuan   

  1. Colloge of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing 211816, China
  • Online:2021-12-01 Published:2021-12-02



  1. 南京工业大学 电气工程与控制科学学院,南京 211816


Emotion recognition is one of the important research fields of artificial intelligence, which relies on the analysis of physiological signals and behavioral characteristics to analyze emotion categories. In order to improve the accuracy of emotion recognition, a multi-modal emotion recognition method based on voice and video images is proposed. The video image modality is realized by using the Local Binary Patterns Histograms method(LBPH) and Sparse Auto-Encoder(SAE) and the improved Convolutional Neural Network(CNN). The voice modality is realized by using the improved Deep-restricted Boltzmann Machine(DBM) and the improved Long-Short Term Memory(LSTM). More detailed features of the image can be obtained by using SAE, deep expression of sound characteristics can be obtained by using DBM, the Back Propagation method(BP) are used to optimize the nonlinear mapping capability of DBM and LSTM, the Global Average Pooling(GAP) method are used to improve the response speed of CNN and LSTM and prevent overfitting. After single modality identification, the recognition results of the two modalities are fused at the decision level?layer based on the weight criterion, and the probabilities of different emotion types will be given. The experimental results show that compared with the traditional single-modal emotion recognition, the method proposed can improve the recognition accuracy, and achieves a recognition rate of 74.9% in the test set of the Chinese natural audio-visual emotion database(cheavd) 2.0. It can also be used for real-time analysis of emotions.

Key words: feature fusion, multimodal fusion, emotion recognition, speech emotion recognition, deep learning



关键词: 特征融合, 多模态融合, 表情识别, 语音情绪识别, 深度学习