Reserch of Multi-modal Emotion Recognition Based on Voice and Video Images

doi:10.3778/j.issn.1002-8331.2104-0306

Abstract

Abstract:

Emotion recognition is one of the important research fields of artificial intelligence, which relies on the analysis of physiological signals and behavioral characteristics to analyze emotion categories. In order to improve the accuracy of emotion recognition, a multi-modal emotion recognition method based on voice and video images is proposed. The video image modality is realized by using the Local Binary Patterns Histograms method（LBPH） and Sparse Auto-Encoder（SAE） and the improved Convolutional Neural Network（CNN）. The voice modality is realized by using the improved Deep-restricted Boltzmann Machine（DBM） and the improved Long-Short Term Memory（LSTM）. More detailed features of the image can be obtained by using SAE, deep expression of sound characteristics can be obtained by using DBM, the Back Propagation method（BP） are used to optimize the nonlinear mapping capability of DBM and LSTM, the Global Average Pooling（GAP） method are used to improve the response speed of CNN and LSTM and prevent overfitting. After single modality identification, the recognition results of the two modalities are fused at the decision level?layer based on the weight criterion, and the probabilities of different emotion types will be given. The experimental results show that compared with the traditional single-modal emotion recognition, the method proposed can improve the recognition accuracy, and achieves a recognition rate of 74.9% in the test set of the Chinese natural audio-visual emotion database（cheavd） 2.0. It can also be used for real-time analysis of emotions.

Key words: feature fusion, multimodal fusion, emotion recognition, speech emotion recognition, deep learning

摘要：

情感识别依靠分析生理信号、行为特征等分析情感类别，是人工智能重要研究领域之一。为提高情感识别的准确性和实时性，提出基于语音与视频图像的多模态情感识别方法。视频图像模态基于局部二值直方图法（LBPH）+稀疏自动编码器（SAE）+改进卷积神经网络（CNN）实现；语音模态基于改进深度受限波尔兹曼机（DBM）和改进长短时间记忆网络（LSTM）实现；使用SAE获得更多图像的细节特征，用DBM获得声音特征的深层表达；使用反向传播算法（BP）优化DBM和LSTM的非线性映射能力，使用全局均值池化（GAP）提升CNN和LSTM的响应速度并防止过拟合。单模态识别后，两个模态的识别结果基于权值准则在决策层融合，给出所属情感分类及概率。实验结果表明，融合识别策略提升了识别准确率，在中文自然视听情感数据库（cheavd）2.0的测试集达到74.9%的识别率，且可以对使用者的情感进行实时分析。

关键词: 特征融合, 多模态融合, 表情识别, 语音情绪识别, 深度学习

WANG Chuanyu, LI Weixiang, CHEN Zhenhuan. Reserch of Multi-modal Emotion Recognition Based on Voice and Video Images[J]. Computer Engineering and Applications, 2021, 57(23): 163-170.

王传昱，李为相，陈震环. 基于语音和视频图像的多模态情感识别研究[J]. 计算机工程与应用, 2021, 57(23): 163-170.

References

[1] HE Z P，LI Z N，YANG F Z，et al.Advances in multimodal emotion recognition based on brain-computer interfaces[J].Brain Sciences，2020，10（10）：687.
[2] 贾俊佳，蒋惠萍，张廷.多模态情感识别综述[J].中央民族大学学报（自然科学版），2020，29（1）：54-58.
JIA J J，JIANG H P，ZHANG T.Survey on multimodal emotion recognition[J].Journal of Minzu University of China（Natural Sciences Edition），2020，29（1）：54-58.
[3] 丁名都，李琳.基于CNN和HOG双路特征融合的人脸表情识别[J].信息与控制，2020，49（1）：47-54.
DING M D，LI L.CNN and HOG dual-path feature fusion for face expression recognition[J].Information and Control，2020，49（1）：47-54.
[4] 兰凌强，李欣，刘淇缘，等.基于联合正则化策略的人脸表情识别方法[J].北京航空航天大学报，2020，46（9）：1797-1806.
LAN L Q，LI Xin，LIU Q Y，et al.Facial expression recognition method based on a joint normalization strategy[J].Journal of Beijing University of Aeronautics and Astronautics，2020，46（9）：1797-1806.
[5] 李田港，叶硕，叶光明，等.基于集成学习的语音情感识别算法研究[J].计算机技术与发展，2020，30（6）：82-86.
LI T G，YE S，YE G M，et al.Research on speech emotion recognition algorithm based on ensemble learning[J].Computer Technology and Development，2020，30（6）：82-86.
[6] 李霞，卢官明，闫静杰，等.多模态维度情感预测综述[J].自动化学报，2018，44（12）：2142-2159.
LI X，LV G M，YAN J J，et al.A survey of dimensional emotion prediction by multimodal cues[J].Acta Automatica Sinica，2018，44（12）：2142-2159.
[7] GENESIS L I.Multi-modal emotion recognition device，method，and storage medium using artificial intelligence[Z]. 2020.
[8] MUHAMMAD A A，MUHAMMAD J K.EEG-based multi-modal emotion recognition using bag of deep features：an optimal feature selection approach[J].Sensors，2019，19（23）：5218.
[9] WANG K，ZENG X X，YANG J F，et al.Cascade attention networks for group emotion recognition with face，body and image cues[C]//Proceedings of the 20th ACM International Conference on Multimodal Interaction，2018：640-645.
[10] LI X，TAO J，JOHNSON M T，et al.Stress and emotion classification using jitter and shimmer features[C]//IEEE International Conference on Acoustics，Speech and Signal Processing，2007：1081-1084.
[11] 蒋斌，钟瑞，张秋闻，等.采用深度学习方法的非正面表情识别综述[J].计算机工程与应用，2021，57（8）：48-61.
JIANG B，ZHONG R，ZHANG Q W，et al.Survey of non-frontal facial expression recognition by using deep learning methods[J].Computer Engineering and Applications，2021，57（8）：48-61.
[12] RUSSELL J A，MEHRABIAN A.Distinguishing anger and anxiety in terms of emotional response factors[J].Journal of Consulting and Clinical Psychology，1974，42（1）：79-82.
[13] OJALA T，PIETIK?INEN M，HARWOOD D，et al.A comparative study of texture measures with classification based on featured distributions[J].Pattern Recognition，1996，29（1）：51-59.
[14] AHONENT，HADID A，PIETIKAINEN M.Face description with local binary patterns：application to face recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2006，28（12）：2037-2041.
[15] 陈巧红，于泽源，孙麒，等.基于注意力机制与LSTM的语音情绪识别[J].浙江理工大学学报（自然科学版），2020，43（6）：815-822.
CHEN Q H，YU Z Y，SUN Q，et al.Speech emotion recognition based on attention mechanism and LSTM[J].Journal of Zhejiang Sci-Tech University（Natural Sciences Edition），2020，43（6）：815-822.
[16] SEYED I S，HOSSEIN K.A deep neural network approach towards real-time on-branch fruit recognition for precision horticulture[J].Expert Systems with Applications，2020，159：113594.
[17] 孙晓虎，李洪均.语音情感识别综述[J].计算机工程与应用，2020，56（11）：1-9.
ZHANG X H，LI H J.Overview of speech emotion recognition[J].Computer Engineering and Applications，2020，56（11）：1-9.
[18] XIE Y，LIANG R Y，LIAN Z L，et al.Speech emotion classification using attention-based LSTM[J].IEEE/ACM Transactions on Audio，Speech and Language Processing，2019，27（11）：1675-1685.
[19] FAROOP M，HUSSAIN F，BALOCH N K，et al.Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network[J].Sensors，2020，20（21）：6008.
[20] 余莉萍，梁镇麟，梁瑞宇.基于改进LSTM的儿童语音情感识别模型[J].计算机工程，2020，46（6）：40-49.
YU L P，LIANG Z L，LIANG R Y.Emotion recognition model for children speech based on improved LSTM[J].Computer Engineering，2020，46（6）：40-49.
[21] SARIYANIDI E，GUNES H，CAVALLARO A.Automatic analysis of facial affect：a survey of registration，representation，and recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2015，37（6）：1113-1133.
[22] MAHESH U N，HANUMANTHA R T.Hybrid approach for facial expression recognition using HJDLBP and LBP histogram in video sequences[J].International Journal of Image，Graphics and Signal Processing，2018，10（2）：1-9.
[23] 张钰莎，蒋盛益.基于MFCC特征提取和改进SVM的语音情感数据挖掘分类识别方法研究[J].计算机应用与软件，2020，37（8）：160-165.
ZHANG Y S，JIANG S Y.Speech emotion data mining classification and recognition method based on MFCC feature extraction and improved SVM[J].Computer Application and Software，2020，37（8）：160-165.
[24] 李田港，叶硕，叶光明，等.基于集成学习的语音情感识别算法研究[J].计算机技术与发展，2020，30（6）：82-86.
LI T G，YE S，YE G M，et al.Research on speech emotion recognition algorithm based on ensemble learning[J].Computer Technology and Development，2020，30（6）：82-86.
[25] 刘全明，辛阳阳.基于卷积神经网络特征图聚类的人脸表情识别[J].计算技术与自动化，2020，39（1）：106-111.
LIU Q M，XIN Y Y.Facial expression recognition based on convolutional neural network feature map clustering[J].Computing Technology and Automation，2020，39（1）：106-111.
[26] 崔子越，皮家甜，陈勇，等.结合改进VGGNet和Focal Loss的人脸表情识别[J].计算机工程与应用，2021，57（19）：171-178.
CUI Z Y，PI J T，et al.Facial expression recognition combined with improved VGGNet and Focal Loss[J].Computer Engineering and Applications，2021，57（19）：171-178.
[27] SAHOO S，ROUTRAY A.Emotion recognition from audio-visual data using rule based decision level fusion[C]//Technology Symposium，2016：7-12.
[28] 闫静杰，卢官明.基于人脸表情和语音的双模态情感识别[J].南京邮电大学学报（自然科学版），2018，38（1）：60-65.
YAN J J，LU G M.Bimodal emotion recognition based on facial expression and speech[J].Journal of Nanjing University of Posts and Telecommunications（Natural Science Edition），2018，38（1）：60-65.
[29] TAWARI A，TRIVEDI M M.Face expression recognition by cross modal data association[J].IEEE Transactions on Multimedia，2013，15（7）：1543-1552.
[30] 刘菁菁，吴晓峰.基于长短时记忆网络的多模态情感识别和空间标注[J].复旦学报（自然科学版），2020，59（5）：565-574.
LIU J J，WU X F.Real-time multimodal emotion recognition and emotion space labeling using LSTM networks[J].Journal of Fudan University（Natural Science），2020，59（5）：565-574.