计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (12): 141-147.DOI: 10.3778/j.issn.1002-8331.2203-0282

• 模式识别与人工智能 • 上一篇    下一篇

一种多任务学习的跨模态视频情感分析方法

缪裕青,董晗,张万桢,周明,蔡国永,杜华巍   

  1. 1.桂林电子科技大学 计算机与信息安全学院,广西 桂林 541004
    2.桂林电子科技大学 广西图像图形与智能处理重点实验室,广西 桂林 541004
    3.桂林航天工业学院 工程综合训练中心,广西 桂林 541004
    4.仲恺农业工程学院 信息科学与技术学院,广州 510225
    5.桂林海威科技股份有限公司,广西 桂林 541004
    6.桂林电子科技大学 广西可信软件重点实验室,广西 桂林 541004
  • 出版日期:2023-06-15 发布日期:2023-06-15

Cross-Modal Video Emotion Analysis Method Based on Multi-Task Learning

MIAO Yuqing, DONG Han, ZHANG Wanzhen, ZHOU Ming, CAI Guoyong, DU Huawei   

  1. 1.School of Computer Science & Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
    2.Guangxi Key Laboratory of Image & Graphics Intelligent Processing, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
    3.Engineering Comprehensive Training Center, Guilin University of Aerospace Technology, Guilin, Guangxi 541004, China
    4.College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China
    5.Guilin Hivision Technology Company, Guilin, Guangxi 541004, China
    6.Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
  • Online:2023-06-15 Published:2023-06-15

摘要: 针对现有跨模态视频情感分析模型中模态融合不充分、空间复杂度较高以及较少考虑说话人本身属性对情感影响等问题,提出了一种结合多头注意力与多任务学习的跨模态视频情感分析模型。对视频进行预处理,得到视频、音频、文本三个模态的特征表示。将得到的特征表示分别输入到GRU网络以提取时序特征。利用所提出的最大池化多头注意力机制,实现文本与视频、文本与音频的两两融合。将融合后的特征输入到情感分类与性别分类多任务网络得到说话人的情感极性与性别属性。实验结果表明,所提模型能够较好地利用模态间的差异信息与说话人性别属性,在有效提升情感识别准确率的同时降低了模型的空间复杂度。

关键词: 视频情感分析, 模态融合, 多头注意力, 多任务学习, 模型复杂度

Abstract: To address the issues of insufficient modal fusion, high spatial complexity, and less consideration of speaker’s own attributes in existing cross-modal video emotion analysis models, this paper proposes a video emotion model combination of multi-head attention and multi-task learning. Firstly, the video is preprocessed to obtain feature representations of three modalities of video, audio, and text. Secondly, the feature representations are input to GRU network to extract timing features. After that, the proposed max-pooling multi-head attention mechanism is used to realize pairwise fusion of text and video, text and audio. Finally, the fused features are input into the emotion classification and gender classification multi-task network to obtain the emotional classification and gender of speaker. Experimental results show that the proposed model can make better use of the difference information between modalities and gender attributes of speaker, so as to effectively improve accuracy of emotion recognition as well as reducing spatial complexity of model.

Key words: video emotion analysis, modal fusion, multi-head attention, multi-task learning, model complexity