计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (23): 163-170.DOI: 10.3778/j.issn.1002-8331.2104-0306

• 模式识别与人工智能 • 上一篇    下一篇

基于语音和视频图像的多模态情感识别研究

王传昱,李为相,陈震环   

  1. 南京工业大学 电气工程与控制科学学院,南京 211816
  • 出版日期:2021-12-01 发布日期:2021-12-02

Reserch of Multi-modal Emotion Recognition Based on Voice and Video Images

WANG Chuanyu, LI Weixiang, CHEN Zhenhuan   

  1. Colloge of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing 211816, China
  • Online:2021-12-01 Published:2021-12-02

摘要:

情感识别依靠分析生理信号、行为特征等分析情感类别,是人工智能重要研究领域之一。为提高情感识别的准确性和实时性,提出基于语音与视频图像的多模态情感识别方法。视频图像模态基于局部二值直方图法(LBPH)+稀疏自动编码器(SAE)+改进卷积神经网络(CNN)实现;语音模态基于改进深度受限波尔兹曼机(DBM)和改进长短时间记忆网络(LSTM)实现;使用SAE获得更多图像的细节特征,用DBM获得声音特征的深层表达;使用反向传播算法(BP)优化DBM和LSTM的非线性映射能力,使用全局均值池化(GAP)提升CNN和LSTM的响应速度并防止过拟合。单模态识别后,两个模态的识别结果基于权值准则在决策层融合,给出所属情感分类及概率。实验结果表明,融合识别策略提升了识别准确率,在中文自然视听情感数据库(cheavd)2.0的测试集达到74.9%的识别率,且可以对使用者的情感进行实时分析。

关键词: 特征融合, 多模态融合, 表情识别, 语音情绪识别, 深度学习

Abstract:

Emotion recognition is one of the important research fields of artificial intelligence, which relies on the analysis of physiological signals and behavioral characteristics to analyze emotion categories. In order to improve the accuracy of emotion recognition, a multi-modal emotion recognition method based on voice and video images is proposed. The video image modality is realized by using the Local Binary Patterns Histograms method(LBPH) and Sparse Auto-Encoder(SAE) and the improved Convolutional Neural Network(CNN). The voice modality is realized by using the improved Deep-restricted Boltzmann Machine(DBM) and the improved Long-Short Term Memory(LSTM). More detailed features of the image can be obtained by using SAE, deep expression of sound characteristics can be obtained by using DBM, the Back Propagation method(BP) are used to optimize the nonlinear mapping capability of DBM and LSTM, the Global Average Pooling(GAP) method are used to improve the response speed of CNN and LSTM and prevent overfitting. After single modality identification, the recognition results of the two modalities are fused at the decision level?layer based on the weight criterion, and the probabilities of different emotion types will be given. The experimental results show that compared with the traditional single-modal emotion recognition, the method proposed can improve the recognition accuracy, and achieves a recognition rate of 74.9% in the test set of the Chinese natural audio-visual emotion database(cheavd) 2.0. It can also be used for real-time analysis of emotions.

Key words: feature fusion, multimodal fusion, emotion recognition, speech emotion recognition, deep learning