Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation

doi:10.3778/j.issn.1002-8331.2312-0033

Abstract

Abstract: The existing self-supervised representation algorithms mainly focus on the short-term motion characteristics between video frames, but the variation range of the action sequence between frames is small, and the depth feature expression ability of single-view data is affected due to semantic limitations, so the rich multi-view information in video actions is not fully utilized. Therefore, a temporal contrast learning algorithm based on cross-view semantic consistency is proposed to self-supervised learn the action temporal variation characteristics embedded in both RGB frames and optical flow field data. The main ideas are as follows: to design a local temporal contrast learning method, adopt different positive and negative sample division strategies to explore the temporal correlation and discriminative differentiability between non-overlapping segments of the same instance, and enhance the fine-grained feature expression capability; to study the global contrast learning method to increase the positive samples by cross-view semantic co-training, learn the semantic consistency of different views of multiple instances, and improve the generalization ability of the model. The model performance is evaluated through two downstream tasks, and the experimental results on UCF101 and HMDB51 datasets show that the proposed method improves on average 2~3.5 percentage points over cutting-edge mainstream methods on action recognition and video retrieval tasks.

Key words: self-supervised learning, video representation learning, temporal contrastive learning, local contrastive learning, cross-view co-training

摘要： 现有的自监督表征算法主要关注视频帧之间的短期运动特性，但是帧间动作序列的变化幅度较小，而且单视图数据因语义受限影响深度特征表达能力，视频动作中丰富的多视图信息未被充分利用。为此提出基于跨视图语义一致性的时序对比学习算法，自监督学习RGB帧和光流场两种数据中蕴含的动作时序变化特性，主要思路为：设计局部时序对比学习方法，采用不同正负样本划分策略，挖掘同一实例不重叠片段之间的时序相关性和判别可分性，增强细粒度特征表达能力；研究全局对比学习方法，通过跨视图语义协同训练来增加正样本，学习多实例不同视图的语义一致性，提高模型的泛化能力。通过两个下游任务对模型效果进行评估，在UCF101和HMDB51数据集的实验结果表明，所提方法在动作识别和视频检索任务上，较前沿主流方法平均提升了2~3.5个百分点。

关键词: 自监督学习, 视频表征学习, 时序对比学习, 局部对比学习, 跨视图协同

WANG Lulu, XU Zengmin, ZHANG Xuelian, MENG Ruxing, LU Tao. Cross-View Temporal Contrastive Learning for Self-Supervised Video Representation[J]. Computer Engineering and Applications, 2024, 60(18): 158-166.

王露露, 徐增敏, 张雪莲, 蒙儒省, 卢涛. 跨视图时序对比学习的自监督视频表征算法[J]. 计算机工程与应用, 2024, 60(18): 158-166.

References

[1] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the European Conference on Computer Vision, Amsterdam, Oct 10-16, 2016. Berlin: Springer, 2016: 20-36.
[2] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. New York: IEEE Press, 2015: 4489-4497.
[3] ABU-EL-HAIJA S, KOTHARI N, LEE J, et al. Youtube-8m: a large-scale video classification benchmark[EB/OL].(2016-09-27)[2022-11-16]. https://arxiv.org/abs/1609.08675.
[4] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL].(2017-05-19)[2022-11-16]. https://arxiv.org/abs/1705.06950.
[5] HE K, FAN H, WU Y, et al. Momentum contrast for uns-upervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. New York: IEEE Press, 2020: 9729-9738.
[6] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the International Conference on Machine Learning, Jul 12-18, 2020. New York: PMLR, 2020: 1597-1607.
[7] GRILL J B, STRUB F, ALTCHé F, et al. Bootstrap your own latent a new approach to self-supervised learning[C]//Advances in Neural Information Processing Systems, 2020: 21271-21284.
[8] CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments[C]//Advances in Neural Information Processing Systems, 2020: 9912-9924.
[9] CHEN X, HE K. Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 15750-15758.
[10] HAN T, XIE W, ZISSERMAN A. Self-supervised co-training for video representation learning[C]//Advances in Neural Information Processing Systems, 2020: 5679-5690.
[11] PAN T, SONG Y, YANG T, et al. Videomoco: contrastive video representation learning with temporally adversarial examples[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 11205-11214.
[12] LIU Y, WANG K, LAN H, et al. Temporal contrastive graph learning for video action recognition and retrieval[EB/OL].(2021-03-17) [2022-11-05]. https://arxiv.org/abs/2101.00820.
[13] QIAN R, MENG T, GONG B, et al. Spatiotemporal contrastive video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 6964-6974.
[14] KUANG H, ZHU Y, ZHANG Z, et al. Video contrastive learning with global context[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. New York: IEEE Press, 2021: 3195-3204.
[15] TAO L, WANG X, YAMASAKI T. Self-supervised video representation learning using inter-intra contrastive framework[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: Association for Computing Machinery, 2020: 2193-2201.
[16] OORD A, LI Y, VINYALS O. Representation learning with contrastive predictive coding[EB/OL].(2019-01-22)[2022-11-05]. https://arxiv.org/abs/1807.03748.
[17] DORKENWALD M, XIAO F, BRATTOLI B, et al. SCVRL: shuffled contrastive video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-21, 2022. New York: IEEE Press, 2022: 4132-4141.
[18] WANG J, JIAO J, LIU Y H. Self-supervised video representation learning by pace prediction[C]//Proceedings of the European Conference on Computer Vision, Aug 23-28, 2020. Cham: Springer, 2020: 504-521.
[19] SINGH A, CHAKRABORTY O, VARSHNEY A, et al. Semi-supervised action recognition with temporal contrastive learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 10389-10399.
[20] DAVE I, GUPTA R, RIZVE M N, et al. TCLR: temporal contrastive learning for video representation[J]. Computer Vision and Image Understanding, 2022, 219: 103406-103414.
[21] JIAO J, CAI Y, ALSHARID M, et al. Self-supervised con-trastive video-speech representation learning for ultrasound[C]//Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Lima, Oct 4-8, 2020. Cham: Springer, 2020: 534-543.
[22] XIAO F, TIGHE J, MODOLO D. MaCLR: motion-aware contrastive learning of representations for videos[C]//Proceedings of the European Conference on Computer Vision, Tel-Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 353-370.
[23] NI J, ZHOU N, QIN J, et al. Motion sensitive contrastive learning for self-supervised video representation[C]//Proceedings of the European Conference on Computer Vision, Tel-Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 457-474.
[24] YAO T, ZHANG Y, QIU Z, et al. Seco: exploring sequence supervision for unsupervised representation learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2-9, 2021. Menlo Park: AAAI Press, 2021: 10656-10664.
[25] SUN C, MYERS A, VONDRICK C, et al. Videobert: a joint model for video and language representation learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway, NJ: IEEE, 2019: 7464-7473.
[26] ZHANG D, ZHENG Z, LI M, et al. Reinforced similarity learning: siamese relation networks for robust object tracking[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: ACM, 2020: 294-303.
[27] ZHANG D , ZHENG Z. Joint representation learning with deep quadruplet network for real-time visual tracking[C]//Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, Jul 19-24, 2020. Piscataway: IEEE, 2020: 1-8.
[28] SOOMRO K, ZAMIR A R, SHAH M. A dataset of 101 human action classes from videos in the wild[EB/OL]. (2012-12-03) [2022-11-05]. https://arxiv.org/abs/1212.0402.
[29] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Nov 6-13, 2011. New York: IEEE Press, 2011: 2556-2563.
[30] XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning for video understanding[C]//Proceedings of the European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 305-321.
[31] ZACH C, POCK T, BISCHOF H. A duality based approach for realtime tv-l 1 optical flow[C]//Pattern Recognition: 29th DAGM Symposium, Heidelberg, Sep 12-14, 2007. Berlin: Springer, 2007: 214-223.
[32] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, Jul 21-26, 2017. New York: IEEE Press, 2017: 6299-6308.
[33] XU D, XIAO J, ZHAO Z, et al. Self-supervised spatiotemporal learning via video clip order prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. New York: IEEE Press, 2019: 10334-10343.
[34] WANG J, JIAO J, BAO L, et al. Self-supervised video representation learning by uncovering spatio-temporal statistics[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(7): 3791-3806.
[35] WANG J, GAO Y, LI K, et al. Enhancing unsupervised video representation learning by decoupling the scene and the motion[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2-9, 2021. Menlo Park, CA: AAAI Press, 2021: 10129-10137.
[36] GUO S, XIONG Z, ZHONG Y, et al. Cross-architecture self-supervised video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-21, 2022. New York: IEEE Press, 2022: 19270-19279.
[37] WANG J, GAO Y, LI K, et al. Removing the background by adding the background: towards background robust self-supervised video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 11804-11813.
[38] DUAN H, ZHAO N, CHEN K, et al. Transrank: self-supervised video representation learning via ranking-based transformation recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-21, 2022. New York: IEEE Press, 2022: 3000-3010.