Cross-Modal Video Emotion Analysis Method Based on Multi-Task Learning

doi:10.3778/j.issn.1002-8331.2203-0282

Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (12): 141-147.DOI: 10.3778/j.issn.1002-8331.2203-0282

• Pattern Recognition and Artificial Intelligence • Previous Articles Next Articles

Cross-Modal Video Emotion Analysis Method Based on Multi-Task Learning

MIAO Yuqing, DONG Han, ZHANG Wanzhen, ZHOU Ming, CAI Guoyong, DU Huawei

1.School of Computer Science & Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
2.Guangxi Key Laboratory of Image & Graphics Intelligent Processing, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
3.Engineering Comprehensive Training Center, Guilin University of Aerospace Technology, Guilin, Guangxi 541004, China
4.College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China
5.Guilin Hivision Technology Company, Guilin, Guangxi 541004, China
6.Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China

Online:2023-06-15 Published:2023-06-15

一种多任务学习的跨模态视频情感分析方法

缪裕青，董晗，张万桢，周明，蔡国永，杜华巍

1.桂林电子科技大学计算机与信息安全学院，广西桂林 541004
2.桂林电子科技大学广西图像图形与智能处理重点实验室，广西桂林 541004
3.桂林航天工业学院工程综合训练中心，广西桂林 541004
4.仲恺农业工程学院信息科学与技术学院，广州 510225
5.桂林海威科技股份有限公司，广西桂林 541004
6.桂林电子科技大学广西可信软件重点实验室，广西桂林 541004

Abstract

Abstract: To address the issues of insufficient modal fusion, high spatial complexity, and less consideration of speaker’s own attributes in existing cross-modal video emotion analysis models, this paper proposes a video emotion model combination of multi-head attention and multi-task learning. Firstly, the video is preprocessed to obtain feature representations of three modalities of video, audio, and text. Secondly, the feature representations are input to GRU network to extract timing features. After that, the proposed max-pooling multi-head attention mechanism is used to realize pairwise fusion of text and video, text and audio. Finally, the fused features are input into the emotion classification and gender classification multi-task network to obtain the emotional classification and gender of speaker. Experimental results show that the proposed model can make better use of the difference information between modalities and gender attributes of speaker, so as to effectively improve accuracy of emotion recognition as well as reducing spatial complexity of model.

Key words: video emotion analysis, modal fusion, multi-head attention, multi-task learning, model complexity

摘要： 针对现有跨模态视频情感分析模型中模态融合不充分、空间复杂度较高以及较少考虑说话人本身属性对情感影响等问题，提出了一种结合多头注意力与多任务学习的跨模态视频情感分析模型。对视频进行预处理，得到视频、音频、文本三个模态的特征表示。将得到的特征表示分别输入到GRU网络以提取时序特征。利用所提出的最大池化多头注意力机制，实现文本与视频、文本与音频的两两融合。将融合后的特征输入到情感分类与性别分类多任务网络得到说话人的情感极性与性别属性。实验结果表明，所提模型能够较好地利用模态间的差异信息与说话人性别属性，在有效提升情感识别准确率的同时降低了模型的空间复杂度。

关键词: 视频情感分析, 模态融合, 多头注意力, 多任务学习, 模型复杂度

MIAO Yuqing, DONG Han, ZHANG Wanzhen, ZHOU Ming, CAI Guoyong, DU Huawei. Cross-Modal Video Emotion Analysis Method Based on Multi-Task Learning[J]. Computer Engineering and Applications, 2023, 59(12): 141-147.

缪裕青, 董晗, 张万桢, 周明, 蔡国永, 杜华巍. 一种多任务学习的跨模态视频情感分析方法[J]. 计算机工程与应用, 2023, 59(12): 141-147.

References

[1] ZADEH A，CHEN M，PORIA S，et al.Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.Stroudsburg，PA：ACL，2017：1103-1114.
[2] TSAI Y H H，BAI S J，LIANG P P，et al.Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Stroudsburg，PA：ACL，2019：6558-6569.
[3] 程艳，尧磊波，张光河，等.基于注意力机制的多通道CNN和BiGRU的文本情感倾向性分析[J].计算机研究与发展，2020，57（12）：2583-2595.
CHENG Y，YAO L B，ZHANG G H，et al.Text sentiment orientation analysis of multi-channels CNN and BiGRU based on attention mechanism[J].Journal of Computer Research and Development，2020，57（12）：2583-2595.
[4] XU H Y，ZHANG H，HAN K，et al.Learning alignment for multimodal emotion recognition from speech[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association.Grenoble，France：ISCA，2019：3569-3573.
[5] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.Cambridge，MA：MIT Press，2017：6000-6010.
[6] HUANG J，TAO J H，LIU B，et al.Multimodal transformer fusion for continuous emotion recognition[C]//Proceedings of the 2020 International Conference on Acoustics，Speech and Signal Processing（ICASSP）.Piscataway，NJ：IEEE，2020：3507-3511.
[7] ZHANG Y，YANG Q.A survey on multi-task learning[J].IEEE Transactions on Knowledge and Data Engineering，2021：1-20.
[8] YU W M，XU H，MENG F Y，et al.Ch-sims：a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceeding of the 58th Annual Meeting of the Association for Computational Linguistics.Cambridge，MA：MIT Press，2020：3718-3727.
[9] 沈瑞琳，潘伟民，彭成，等.基于多任务学习的微博谣言检测方法[J].计算机工程与应用，2021，57（24）：192-197.
SHEN R L，PAN W M，PENG C，et al.Microblog rumor detection method based on multi-task learning[J].Computer Engineering and Applications，2021，57（24）：192-197.
[10] CHO K，VAN M B，GULCEHRE C，et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing（EMNLP）.Stroudsburg，PA：ACL，2014：1724-1734.
[11] ZADEH A，ZELLERS R，PINCUS E，et al.Multimodal sentiment intensity analysis in videos：facial gestures and verbal messages[J].IEEE Intelligent Systems，2016，31（6）：82-88.
[12] WILLIAMS J，KLEINEGESSE S，COMANESCU R，et al.Recognizing emotions in video using multimodal DNN feature fusion[C]//Proceedings of Grand Challenge and Workshop on Human Multimodal Language（Challenge-HML）.Stroudsburg，PA：ACL，2018：11-19.
[13] HAZARIKA D，ZIMMERMANN R，PORIA S.Misa：modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia.New York：ACM，2020：1122-1131.

[1]	RAN Xiansheng, ZHANG Zhiyun, CHEN Zhuo, SU Shanjie, CHEN Junhao. Motorcycle Helmet Wearing Detection Based on Improved DeepSORT Algorithm [J]. Computer Engineering and Applications, 2023, 59(5): 194-204.
[2]	LIN Honghui, LIU Jianhua, ZHENG Zhixiong, HU Renyuan, LUO Yixuan. Multi-Task Network for Joint Dialog Act Recognition and Sentiment Classification [J]. Computer Engineering and Applications, 2023, 59(3): 104-111.
[3]	PAN Mengzhu, LI Qianmu, QIU Tian. Survey of Research on Deep Multimodal Representation Learning [J]. Computer Engineering and Applications, 2023, 59(2): 48-64.
[4]	WANG Zhuoyue, CHEN Yanguang, XING Tiejun, SUN Yuanyuan, YANG Liang, LIN Hongfei. Joint Entity and Relation Extraction for Multi-Crime Legal Documents with Multi-Task Learning [J]. Computer Engineering and Applications, 2023, 59(2): 178-184.
[5]	TANG Lihua, LU Ning, LAN Chuangchuang, CHEN Ronghua, WU Jiansheng. Predicting Bioactivities of Ligands Acting with G Protein-Coupled Receptors via Deep Transfer Learning [J]. Computer Engineering and Applications, 2023, 59(13): 120-128.
[6]	XIANG Deping, ZHANG Pu, XIANG Shiming, PAN Chunhong. Multi-Modal Meteorological Forecasting Based on Transformer [J]. Computer Engineering and Applications, 2023, 59(10): 94-103.
[7]	LIN Shuang, WANG Xiaojun. Semi-supervised Generalized Zero-Shot Learning Using Modal Fusion [J]. Computer Engineering and Applications, 2022, 58(5): 163-171.
[8]	QU Haicheng, ZHANG Xuecong, WANG Yuping. CNN Pruning Method Based on Information Fusion Strategy [J]. Computer Engineering and Applications, 2022, 58(24): 125-133.
[9]	QI Lixin, WAN Shuzhen, TANG Bin, XU Yichun. Multimodal Fusion Rumor Detection Method Based on Attention Mechanism [J]. Computer Engineering and Applications, 2022, 58(19): 209-217.
[10]	HUANG Wei, LIU Guiquan. Study on Hierarchical Multi-Label Text Classification Method of MSML-BERT Model [J]. Computer Engineering and Applications, 2022, 58(15): 191-201.
[11]	XU Zhijing, GAO Shan. Multi-Modal Emotion Recognition Based on Transformer-ESIM Attention Mechanism [J]. Computer Engineering and Applications, 2022, 58(10): 132-138.
[12]	SHEN Ruilin, PAN Weimin, PENG Cheng, YIN Pengbo. Microblog Rumor Detection Method Based on Multi-task Learning [J]. Computer Engineering and Applications, 2021, 57(24): 192-197.
[13]	XU Changzhuan, WU Yun, LAN Lin, HUANG Zimeng. DR Classification Model of Fusing Attention Mechanism and Multi-tasking Learning [J]. Computer Engineering and Applications, 2021, 57(24): 212-218.
[14]	WANG Chuanyu, LI Weixiang, CHEN Zhenhuan. Reserch of Multi-modal Emotion Recognition Based on Voice and Video Images [J]. Computer Engineering and Applications, 2021, 57(23): 163-170.
[15]	WANG Shiqi, ZENG Qingning, LONG Chao, XIONG Songling, QI Xiaoxiao. Multi-task Learning for Speech Enhancement and Detection [J]. Computer Engineering and Applications, 2021, 57(20): 197-202.

Cross-Modal Video Emotion Analysis Method Based on Multi-Task Learning

一种多任务学习的跨模态视频情感分析方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics