计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (18): 158-166.DOI: 10.3778/j.issn.1002-8331.2312-0033

• 理论与研发 • 上一篇    下一篇

跨视图时序对比学习的自监督视频表征算法

王露露,徐增敏,张雪莲,蒙儒省,卢涛   

  1. 1.桂林电子科技大学 数学与计算科学学院 广西高校数据分析与计算重点实验室,广西 桂林 541004
    2.广西应用数学中心(桂林电子科技大学),广西 桂林 541004
    3.桂林安维科技有限公司,广西 桂林 541010
    4.武汉工程大学 计算机科学与工程学院 智能机器人湖北省重点实验室,武汉 430205
  • 出版日期:2024-09-15 发布日期:2024-09-13

Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation

WANG Lulu, XU Zengmin, ZHANG Xuelian, MENG Ruxing, LU Tao   

  1. 1.Guangxi Colleges and Universities Key Laboratory of Data Analysis and Computation, School of Mathematics and Computing Science, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
    2.Center for Applied Mathematics of Guangxi (GUET), Guilin, Guangxi 541004, China
    3.Anview.ai, Guilin, Guangxi 541010, China
    4.Hubei Key Laboratory of Intelligent Robot,School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China
  • Online:2024-09-15 Published:2024-09-13

摘要: 现有的自监督表征算法主要关注视频帧之间的短期运动特性,但是帧间动作序列的变化幅度较小,而且单视图数据因语义受限影响深度特征表达能力,视频动作中丰富的多视图信息未被充分利用。为此提出基于跨视图语义一致性的时序对比学习算法,自监督学习RGB帧和光流场两种数据中蕴含的动作时序变化特性,主要思路为:设计局部时序对比学习方法,采用不同正负样本划分策略,挖掘同一实例不重叠片段之间的时序相关性和判别可分性,增强细粒度特征表达能力;研究全局对比学习方法,通过跨视图语义协同训练来增加正样本,学习多实例不同视图的语义一致性,提高模型的泛化能力。通过两个下游任务对模型效果进行评估,在UCF101和HMDB51数据集的实验结果表明,所提方法在动作识别和视频检索任务上,较前沿主流方法平均提升了2~3.5个百分点。

关键词: 自监督学习, 视频表征学习, 时序对比学习, 局部对比学习, 跨视图协同

Abstract: The existing self-supervised representation algorithms mainly focus on the short-term motion characteristics between video frames, but the variation range of the action sequence between frames is small, and the depth feature expression ability of single-view data is affected due to semantic limitations, so the rich multi-view information in video actions is not fully utilized. Therefore, a temporal contrast learning algorithm based on cross-view semantic consistency is proposed to self-supervised learn the action temporal variation characteristics embedded in both RGB frames and optical flow field data. The main ideas are as follows: to design a local temporal contrast learning method, adopt different positive and negative sample division strategies to explore the temporal correlation and discriminative differentiability between non-overlapping segments of the same instance, and enhance the fine-grained feature expression capability; to study the global contrast learning method to increase the positive samples by cross-view semantic co-training, learn the semantic consistency of different views of multiple instances, and improve the generalization ability of the model. The model performance is evaluated through two downstream tasks, and the experimental results on UCF101 and HMDB51 datasets show that the proposed method improves on average 2~3.5 percentage points over cutting-edge mainstream methods on action recognition and video retrieval tasks.

Key words: self-supervised learning, video representation learning, temporal contrastive learning, local contrastive learning, cross-view co-training