Speech Emotion Recognition Based on Dual-Channel Convolutional Gated Recurrent Network

doi:10.3778/j.issn.1002-8331.2107-0249

Abstract

Abstract: In order to build an efficient speech emotion recognition model, make full use of the information contained in different emotion features, a dual-channel convolutional gated recurrent network model based on the self-attention mechanism is constructed, which uses spectrogram features and LLDs features as the input. Simultaneously, in order to solve the problem that the cross-entropy loss function cannot increase the compactness and separation of the emotional characteristics of the speech, a new loss—CCC-Loss is proposed which is combined with the consistency correlation coefficient. First, the two features are separately input into the CGRU model to extract deep features and the self-attention mechanism is used to give higher weight to the key moments. Then, the model uses CCC-Loss and cross-entropy loss to train together. CCC-Loss calculates the ratio of the sum of consistency correlation coefficients of different types of emotional samples and of similar emotion samples and then uses it as the loss term, which improves the intra-class correlation of sample features and improves the feature discrimination ability of the model. Finally, the classification results of the two networks are used to achieve decision fusion. The proposed method has achieved 92.90%, 88.54% and 90.58% recognition results on the EMODB, RAVDESS and CASIA databases, which is better than baseline models such as ACRNN and DSCNN.

Key words: speech emotion recognition, convolutional neural networks, gate recurrent unit, self-attention, loss function, deep learning, consistency correlation coefficient

摘要： 为了构建高效的语音情感识别模型，充分利用不同情感特征所包含的信息，将语谱图特征和LLDs特征相结合，构建了一种基于自注意力机制的双通道卷积门控循环网络模型。同时，为了解决交叉熵损失函数无法增大语音情感特征类内紧凑性和类间分离性的问题，结合一致性相关系数提出新的损失函数——一致性相关损失（CCC-Loss）。将语谱图和LLDs特征分别输入CGRU模型提取深层特征并引入自注意力机制为关键时刻赋予更高的权重；使用CCC-Loss与交叉熵损失共同训练模型，CCC-Loss将不同类情感样本的一致性相关系数之和与同类情感样本的一致性相关系数之和的比值作为损失项，改善了样本特征的类内类间相关性，提高了模型的特征判别能力；将两个网络的分类结果进行决策层融合。所提出的方法在EMODB、RAVDESS以及CASIA数据库上分别取得了92.90%、88.54%以及90.58%的识别结果，相比于ACRNN、DSCNN等基线模型识别效果更好。

关键词: 语音情感识别, 卷积神经网络, 门控循环单元, 自注意力机制, 损失函数, 深度学习, 一致性相关系数

SUN Hanyu, HUANG Lixia, ZHANG Xueying, LI Juan. Speech Emotion Recognition Based on Dual-Channel Convolutional Gated Recurrent Network[J]. Computer Engineering and Applications, 2023, 59(2): 170-177.

孙韩玉, 黄丽霞, 张雪英, 李娟. 基于双通道卷积门控循环网络的语音情感识别[J]. 计算机工程与应用, 2023, 59(2): 170-177.

References

[1] 张雪英，孙颖，张卫，等.语音情感识别的关键技术[J].太原理工大学学报，2015（6）：629-636.
ZHANG X Y，SUN Y，ZHANG W，et al.The key technology of speech emotion recognition[J].Journal of Taiyuan University of Technology，2015（6）：629-636.
[2] 孙晓虎，李洪均.语音情感识别综述[J].计算机工程与应用，2020，56（11）：1-9.
SUN X H，LI H J.Survey on speech emotion recogntion[J].Computer Engineering and Applications，2020，56（11）：1-9.
[3] GUO L，WANG L，DANG J，et al.A feature fusion method based on extreme learning machine for speech emotion recognition[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：2666-2670.
[4] MIRSAMADI S，BARSOUM E，ZHANG C.Automatic speech emotion recognition using recurrent neural neworks with local attention[C]//IEEE International Conference on Acoustics Speech and Signal Processing，New Orleans，2017：2227-2231.
[5] YEONGUK Y，YOON J K.Attention-LSTM-attention model for speech emotion recognition and analysis of IEMOCAP database[J].Electronics，2020，9（5）：713.
[6] MAO Q，DONG M，HUANG Z，et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE Transactions on Multimedia，2014，16（8）：2203-2213.
[7] LAWRENCE I，LIN K.A concordance correlation coeffcient to evaluate reproducibility[J].Biometrics，1989：255-268.
[8] KWON S.A CNN-assisted enhanced audio signal processing for speech emotion recognition[J].Sensors，2020，20（1）：183.
[9] PANDEY S K，SHEKHAWAT H S，PRASANNA S R M.Deep learning techniques for speech emotion recognition：a review[C]//2019 29th International Conference Radioelektronika（RADIOELEKTRONIKA），2019：1-6.
[10] ZHONG Y，HU Y，HUANG H，et al.A lightweight model based on separable convolution for speech emotion recognition[C]//Proceedings of INTERSPEECH，2020：3331-3335.
[11] 苏志明，王烈.基于角度距离损失与小尺度核网络的表情识别[J].电讯技术，2021，61（4）：396-402.
SU Z M，WANG L.Facial expression recognition based on angular distance loss and small-scale kernel network[J].Telecommunication Technology，2021，61（4）：396-402.
[12] WEN Y，ZHANG K，LI Z，et al.A discriminative feature learning approach for deep face recognition[C]//European Conference on Computer Vision.Cham：Springer，2016：499-515.
[13] HADSELL R，CHOPRA S，LE C Y.Dimensionality reduction by learning an invariant mapping[C]//2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition（CVPR’06），2006：1735-1742.
[14] LI Y，ZHAO T，KAWAHARA T.Improved end-to-end speech emotion recognition using self attention mechnism and multitask learning[C]//Proceedings of INTERSPEECH，2019：2803-2807.
[15] 高帆，张雪英，黄丽霞，等.基于DBM-LSTM的多特征语音情感识别[J].计算机工程与设计，2020，41（2）：465-470.
GAO F，ZHANG X Y，HUANG L X，et al.Multi-feature speech emotion recognition based on DBM-LSTM[J].Computer Engineering and Design，2020，41（2）：465-470.
[16] BURKHARDT F，PAESCHKE A，ROLFES M，et al.A database of German emotional speech[C]//Ninth European Conference on Speech Communication and Technology，2005.
[17] LIVINGSTONE S R，RUSSO F A.The Ryerson Audio-Visual Database of Emotional Speech and Song（RAVDESS）：a dynamic，multimodal set of facial and vocal expressions in North American English[J].PloS One，2018，13（5）：e0196391.
[18] Institute of Automation，Chinese Academy of Science.CASIA Chinese emotional corpus[DB/OL].[2021-05-10].http：//more.datatang.com/data/39277.
[19] CHEN M，HE X，YANG J，et al.3-D convolutional recurrent neural networks with attention model for speech emotion recognition[J].IEEE Signal Processing Letters，2018，25（10）：1440-1444.
[20] ANVARJON T，KWON S.Deep-net：a lightweight CNN based speech emotion recognition system using deep frequency features[J].Sensors，2020，20（18）：5212.
[21] LI Y，BAIDOO C，CAI T，et al.Speech emotion recognition using 1D CNN with no attention[C]//2019 23rd International Computer Science and Engineering Conference（ICSEC），2019：351-356.
[22] FAROOQ M，HUSSAIN F，BALOCH N K，et al.Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network[J].Sensors，2020，20（21）：6008.
[23] 张会云，黄鹤鸣.基于异构并行神经网络的语音情感识别[J].计算机工程，2022，48（4）：113-118.
ZHANG H Y，HUANG H M.Speech emotion recognition based on heterogeneous parallel neural networks[J].Computer Engineering，2022，48（4）：113-118.
[24] 姜芃旭，傅洪亮，陶华伟，等.一种基于卷积神经网络特征表征的语音情感识别方法[J].电子器件，2019，42（4）：998-1001.
JIANG P X，FU H L，TAO H W，et al.A method of speech emotion recognition based on feature representation of convolutional neural network[J].Electronic Devices，2019，42（4）：998-1001.

[25] 缪裕青，邹巍，刘同来，等.基于参数迁移和卷积循环神经网络的语音情感识别[J].计算机工程与应用，2019，55（10）：135-140.

MIAO Y Q，ZOU W，LIU T L，et al.Speech emotion recognition based on parameter transfer and convolutional recurrent neural network[J].Computer Engineering and Applications，2019，55（10）：135-140.