计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (6): 140-146.DOI: 10.3778/j.issn.1002-8331.1811-0332

• 模式识别与人工智能 • 上一篇    下一篇

音视频双模态情感识别融合框架研究

宋冠军,张树东,卫飞高   

  1. 首都师范大学 信息工程学院,北京 100048
  • 出版日期:2020-03-15 发布日期:2020-03-13

Research on Audio-Visual Dual-Modal Emotion Recognition Fusion Framework

SONG Guanjun, ZHANG Shudong, WEI Feigao   

  1. College of Information Engineering, Capital Normal University, Beijing 100048, China
  • Online:2020-03-15 Published:2020-03-13

摘要:

针对双模态情感识别框架识别率低、可靠性差的问题,对情感识别最重要的两个模态语音和面部表情进行了双模态情感识别特征层融合的研究。采用基于先验知识的特征提取方法和VGGNet-19网络分别对预处理后的音视频信号进行特征提取,以直接级联的方式并通过PCA进行降维来达到特征融合的目的,使用BLSTM网络进行模型构建以完成情感识别。将该框架应用到AViD-Corpus和SEMAINE数据库上进行测试,并和传统情感识别特征层融合框架以及基于VGGNet-19或BLSTM的框架进行了对比。实验结果表明,情感识别的均方根误差(RMSE)得到降低,皮尔逊相关系数(PCC)得到提高,验证了文中提出方法的有效性。

关键词: 音视频, 双模态, 特征层融合, 情感识别, BLSTM

Abstract:

Aiming at the problem of low recognition rate and poor reliability of dual-modal emotion recognition framework, the fusion of two most important modal speech and facial expression in dual-modal emotion recognition is studied. Feature extraction method based on prior knowledge and VGGNet-19 network are used to extract features of pre-processed audio and video signals respectively. Feature fusion is achieved by direct cascade and dimensionality reduction through PCA. BLSTM network is used to construct model to complete emotion recognition. The framework is applied to AViD-Corpus and SEMAINE databases for testing, and is compared with the traditional framework of feature level fusion of emotional recognition and the framework based on VGGNet-19 or BLSTM. The experimental results show that the Root Mean Square Error(RMSE) of emotional recognition is reduced and the Pearson Correlation Coefficient(PCC) is improved, which verifies the effectiveness of the proposed method.

Key words: audio-visual, dual-modal, feature-level fusion, emotion recognition, BLSTM