计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (14): 219-226.DOI: 10.3778/j.issn.1002-8331.2012-0122

• 图形图像处理 • 上一篇    下一篇

多模态特征融合的视频记忆度预测

常诗颖,胡燕   

  1. 武汉理工大学 计算机科学与技术学院,武汉 430070
  • 出版日期:2022-07-15 发布日期:2022-07-15

Video Memorability Prediction Based on Multi-Modal Features Fusion

CHANG Shiying, HU Yan   

  1. School of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, China
  • Online:2022-07-15 Published:2022-07-15

摘要: 随着网络视频的爆炸式增长,视频记忆度成为热点研究方向。视频记忆度是衡量一个视频令人难忘的程度指标,设计自动预测视频记忆度的计算模型有广泛的应用和前景。当前对视频记忆度预测的研究多集中于普遍的视觉特征或语义因素,没有考虑深度特征对视频记忆度的影响。着重探索了视频的深度特征,在视频预处理后利用现有的深度估计模型提取深度图,将视频原始图像和深度图一起输入预训练的ResNet152网络来提取深度特征;使用TF-IDF算法提取视频的语义特征,并对视频记忆度有影响的单词赋予不同的权重;将深度特征、语义特征和从视频内容中提取的C3D时空特征进行后期融合,提出了一个融合多模态的视频记忆度预测模型。在MediaEval 2019会议提供的大型公开数据集(VideoMem)上进行实验,在视频的短期记忆度预测任务中达到了0.545(长期记忆度预测任务:0.240)的Spearman相关性,证明了该模型的有效性。

关键词: 视频记忆度, 多模态, 特征融合

Abstract: With the explosive growth of online videos, video memorability has become a research hotspot. Video memorability is a metric to describe that how memorable the video is, designing calculation models for automatically predicting video memorability has a wide range of applications and prospects. Most of the current researches on video memorability prediction focused on the common visual features or semantic factors, while didn’t consider the influence of depth features on video memorability. This paper focuses on exploring the depth features of the video. After the video is preprocessed, the depth estimation model is used to extract the depth map. The original video images and the depth maps are input into the pre-trained ResNet152 network to extract the depth features;the TF-IDF algorithm is used to extract semantic features of the video, and different weights are assigned to words that have an impact on video memorability; finally, depth features, semantic features, and C3D spatiotemporal features extracted from video content are late fused. A fusion multi-modal video memorability prediction model is proposed. Experiments are conducted on the large public dataset (VideoMem) provided by the MediaEval 2019 conference. The experimenal tresults achieve a Spearman’s rank correlation of 0.545 (respectively 0.240)for short-term (resp. long-term) memorability prediction, which proves the effectiveness of the model.

Key words: video memorability, multi-modal, features fusion