[1] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 1725-1732.
[2] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems, 2014.
[3] WANG L M, QIAO Y, TANG X O. Action recognition with trajectory?pooled deep?convolutional descriptors[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 4305-4314.
[4] DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 2625-2634.
[5] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 1933-1941.
[6] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2016: 20-36.
[7] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733.
[8] HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? [C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6546-6555.
[9] XIE S N, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2018: 318-335.
[10] WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7794-7803.
[11] FEICHTENHOFER C, FAN H Q, MALIK J, et al. Slowfast networks for video recognition[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 6201-6210.
[12] QIU Z F, YAO T, NGO C W, et al. MLP-3D: a MLP-like 3D architecture with grouped time mixing[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3052-3062.
[13] ZHU Y, LAN Z Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]//Proceedings of the 14th Asian Conference on Computer Visio. Cham: Springer International Publishing, 2019: 363-378.
[14] LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 7082-7092.
[15] FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 200-210.
[16] PIERGIOVANNI A J, ANGELOVA A, RYOO M S. Tiny video networks[J]. Applied AI Letters, 2022, 3(1): e38.
[17] LIU Z Y, WANG L M, WU W, et al. TAM: temporal adaptive module for video recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 13688-13698.
[18] 孙晓虎, 余阿祥, 申栩林, 等. 混合注意力机制的异常行为识别[J]. 计算机工程与应用, 2023, 59(5): 140-147.
SUN X H, YU A X, SHEN X L, et al. Abnormal behavior recognition based on hybrid attention mechanism[J]. Computer Engineering and Applications, 2023, 59(5): 140-147.
[19] HUANG Z Y, ZHANG S W, PAN L, et al. Temporally-adaptive models for efficient video understanding[J]. arXiv: 2308.05787, 2023.
[20] ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 6816-6826.
[21] NEIMARK D, BAR O, ZOHAR M, et al. Video transformer network[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2021: 3156-3165.
[22] LIU Z, NING J, CAO Y, et al. Video swin transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 3202-3211.
[23] TONG Z, SONG Y, WANG J, et al. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training[C]//Advances in Neural Information Processing Systems, 2022: 10078-10093.
[24] WANG R, CHEN D D, WU Z X, et al. Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 6312-6322.
[25] WANG L, HUANG B, ZHAO Z, et al. VideoMAE v2: scaling video masked autoencoders with dual masking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 14549-14560.
[26] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[27] 陈西江, 梁全恩, 韩贤权, 等. 利用多时间尺度卷积的视频行为识别网络[J]. 国防科技大学学报, 2023, 45(3): 136-145.
CHEN X J, LIANG Q E, HAN X Q, et al. Video behavior recognition network using multi time-scale convolution[J]. Journal of National University of Defense Technology, 2023, 45(3): 136-145.
[28] PAN X, GE C, LU R, et al. On the integration of self-attention and Convolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 815-825.
[29] HAN K, WANG Y H, TIAN Q, et al. GhostNet: more features from cheap operations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 1577-1586.
[30] WANG Z W, SHE Q, SMOLIC A. ACTION-net: multipath excitation for action recognition[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 13209-13218.
[31] LI J F, WEN Y, HE L H. SCConv: spatial and channel reconstruction convolution for feature redundancy[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 6153-6162.
[32] YUN S, OH S J, HEO B, et al. VideoMix: rethinking data augmentation for video classification[J]. arXiv:2012.03457, 2020. |