计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (2): 170-177.DOI: 10.3778/j.issn.1002-8331.2107-0249

• 模式识别与人工智能 • 上一篇    下一篇

基于双通道卷积门控循环网络的语音情感识别

孙韩玉,黄丽霞,张雪英,李娟   

  1. 太原理工大学 信息与计算机学院,太原 030024
  • 出版日期:2023-01-15 发布日期:2023-01-15

Speech Emotion Recognition Based on Dual-Channel Convolutional Gated Recurrent Network

SUN Hanyu, HUANG Lixia, ZHANG Xueying, LI Juan   

  1. College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
  • Online:2023-01-15 Published:2023-01-15

摘要: 为了构建高效的语音情感识别模型,充分利用不同情感特征所包含的信息,将语谱图特征和LLDs特征相结合,构建了一种基于自注意力机制的双通道卷积门控循环网络模型。同时,为了解决交叉熵损失函数无法增大语音情感特征类内紧凑性和类间分离性的问题,结合一致性相关系数提出新的损失函数——一致性相关损失(CCC-Loss)。将语谱图和LLDs特征分别输入CGRU模型提取深层特征并引入自注意力机制为关键时刻赋予更高的权重;使用CCC-Loss与交叉熵损失共同训练模型,CCC-Loss将不同类情感样本的一致性相关系数之和与同类情感样本的一致性相关系数之和的比值作为损失项,改善了样本特征的类内类间相关性,提高了模型的特征判别能力;将两个网络的分类结果进行决策层融合。所提出的方法在EMODB、RAVDESS以及CASIA数据库上分别取得了92.90%、88.54%以及90.58%的识别结果,相比于ACRNN、DSCNN等基线模型识别效果更好。

关键词: 语音情感识别, 卷积神经网络, 门控循环单元, 自注意力机制, 损失函数, 深度学习, 一致性相关系数

Abstract: In order to build an efficient speech emotion recognition model, make full use of the information contained in different emotion features, a dual-channel convolutional gated recurrent network model based on the self-attention mechanism is constructed, which uses spectrogram features and LLDs features as the input. Simultaneously, in order to solve the problem that the cross-entropy loss function cannot increase the compactness and separation of the emotional characteristics of the speech, a new loss—CCC-Loss is proposed which is combined with the consistency correlation coefficient. First, the two features are separately input into the CGRU model to extract deep features and the self-attention mechanism is used to give higher weight to the key moments. Then, the model uses CCC-Loss and cross-entropy loss to train together. CCC-Loss calculates the ratio of the sum of consistency correlation coefficients of different types of emotional samples and of similar emotion samples and then uses it as the loss term, which improves the intra-class correlation of sample features and improves the feature discrimination ability of the model. Finally, the classification results of the two networks are used to achieve decision fusion. The proposed method has achieved 92.90%, 88.54% and 90.58% recognition results on the EMODB, RAVDESS and CASIA databases, which is better than baseline models such as ACRNN and DSCNN.

Key words: speech emotion recognition, convolutional neural networks, gate recurrent unit, self-attention, loss function, deep learning, consistency correlation coefficient