[1] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the European Conference on Computer Vision, Amsterdam, Oct 10-16, 2016. Berlin: Springer, 2016: 20-36.
[2] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. New York: IEEE Press, 2015: 4489-4497.
[3] ABU-EL-HAIJA S, KOTHARI N, LEE J, et al. Youtube-8m: a large-scale video classification benchmark[EB/OL].(2016-09-27)[2022-11-16]. https://arxiv.org/abs/1609.08675.
[4] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[EB/OL].(2017-05-19)[2022-11-16]. https://arxiv.org/abs/1705.06950.
[5] HE K, FAN H, WU Y, et al. Momentum contrast for uns-upervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. New York: IEEE Press, 2020: 9729-9738.
[6] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the International Conference on Machine Learning, Jul 12-18, 2020. New York: PMLR, 2020: 1597-1607.
[7] GRILL J B, STRUB F, ALTCHé F, et al. Bootstrap your own latent a new approach to self-supervised learning[C]//Advances in Neural Information Processing Systems, 2020: 21271-21284.
[8] CARON M, MISRA I, MAIRAL J, et al. Unsupervised learning of visual features by contrasting cluster assignments[C]//Advances in Neural Information Processing Systems, 2020: 9912-9924.
[9] CHEN X, HE K. Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 15750-15758.
[10] HAN T, XIE W, ZISSERMAN A. Self-supervised co-training for video representation learning[C]//Advances in Neural Information Processing Systems, 2020: 5679-5690.
[11] PAN T, SONG Y, YANG T, et al. Videomoco: contrastive video representation learning with temporally adversarial examples[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 11205-11214.
[12] LIU Y, WANG K, LAN H, et al. Temporal contrastive graph learning for video action recognition and retrieval[EB/OL].(2021-03-17) [2022-11-05]. https://arxiv.org/abs/2101.00820.
[13] QIAN R, MENG T, GONG B, et al. Spatiotemporal contrastive video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 6964-6974.
[14] KUANG H, ZHU Y, ZHANG Z, et al. Video contrastive learning with global context[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. New York: IEEE Press, 2021: 3195-3204.
[15] TAO L, WANG X, YAMASAKI T. Self-supervised video representation learning using inter-intra contrastive framework[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: Association for Computing Machinery, 2020: 2193-2201.
[16] OORD A, LI Y, VINYALS O. Representation learning with contrastive predictive coding[EB/OL].(2019-01-22)[2022-11-05]. https://arxiv.org/abs/1807.03748.
[17] DORKENWALD M, XIAO F, BRATTOLI B, et al. SCVRL: shuffled contrastive video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-21, 2022. New York: IEEE Press, 2022: 4132-4141.
[18] WANG J, JIAO J, LIU Y H. Self-supervised video representation learning by pace prediction[C]//Proceedings of the European Conference on Computer Vision, Aug 23-28, 2020. Cham: Springer, 2020: 504-521.
[19] SINGH A, CHAKRABORTY O, VARSHNEY A, et al. Semi-supervised action recognition with temporal contrastive learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 10389-10399.
[20] DAVE I, GUPTA R, RIZVE M N, et al. TCLR: temporal contrastive learning for video representation[J]. Computer Vision and Image Understanding, 2022, 219: 103406-103414.
[21] JIAO J, CAI Y, ALSHARID M, et al. Self-supervised con-trastive video-speech representation learning for ultrasound[C]//Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Lima, Oct 4-8, 2020. Cham: Springer, 2020: 534-543.
[22] XIAO F, TIGHE J, MODOLO D. MaCLR: motion-aware contrastive learning of representations for videos[C]//Proceedings of the European Conference on Computer Vision, Tel-Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 353-370.
[23] NI J, ZHOU N, QIN J, et al. Motion sensitive contrastive learning for self-supervised video representation[C]//Proceedings of the European Conference on Computer Vision, Tel-Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 457-474.
[24] YAO T, ZHANG Y, QIU Z, et al. Seco: exploring sequence supervision for unsupervised representation learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2-9, 2021. Menlo Park: AAAI Press, 2021: 10656-10664.
[25] SUN C, MYERS A, VONDRICK C, et al. Videobert: a joint model for video and language representation learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway, NJ: IEEE, 2019: 7464-7473.
[26] ZHANG D, ZHENG Z, LI M, et al. Reinforced similarity learning: siamese relation networks for robust object tracking[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: ACM, 2020: 294-303.
[27] ZHANG D , ZHENG Z. Joint representation learning with deep quadruplet network for real-time visual tracking[C]//Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, Jul 19-24, 2020. Piscataway: IEEE, 2020: 1-8.
[28] SOOMRO K, ZAMIR A R, SHAH M. A dataset of 101 human action classes from videos in the wild[EB/OL]. (2012-12-03) [2022-11-05]. https://arxiv.org/abs/1212.0402.
[29] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Nov 6-13, 2011. New York: IEEE Press, 2011: 2556-2563.
[30] XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning for video understanding[C]//Proceedings of the European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 305-321.
[31] ZACH C, POCK T, BISCHOF H. A duality based approach for realtime tv-l 1 optical flow[C]//Pattern Recognition: 29th DAGM Symposium, Heidelberg, Sep 12-14, 2007. Berlin: Springer, 2007: 214-223.
[32] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, Jul 21-26, 2017. New York: IEEE Press, 2017: 6299-6308.
[33] XU D, XIAO J, ZHAO Z, et al. Self-supervised spatiotemporal learning via video clip order prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. New York: IEEE Press, 2019: 10334-10343.
[34] WANG J, JIAO J, BAO L, et al. Self-supervised video representation learning by uncovering spatio-temporal statistics[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(7): 3791-3806.
[35] WANG J, GAO Y, LI K, et al. Enhancing unsupervised video representation learning by decoupling the scene and the motion[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2-9, 2021. Menlo Park, CA: AAAI Press, 2021: 10129-10137.
[36] GUO S, XIONG Z, ZHONG Y, et al. Cross-architecture self-supervised video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-21, 2022. New York: IEEE Press, 2022: 19270-19279.
[37] WANG J, GAO Y, LI K, et al. Removing the background by adding the background: towards background robust self-supervised video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. New York: IEEE Press, 2021: 11804-11813.
[38] DUAN H, ZHAO N, CHEN K, et al. Transrank: self-supervised video representation learning via ranking-based transformation recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-21, 2022. New York: IEEE Press, 2022: 3000-3010. |