[1] WANG L,HUYNH D Q,KONIUSZ P.A comparative review of recent kinect-based action recognition algorithms[J].IEEE Transactions on Image Processing,2020,29:15-28.
[2] ZHAO H,WILDES R P.Review of video predictive understanding:early action recognition and future action prediction[J].arXiv:2107.05140,2021.
[3] SUN Z,LIU J,KE Q,et al.Human action recognition from various data modalities:a review[J].arXiv:2012. 11866,2020.
[4] SIMONYAN K,ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems,2014:568-576.
[5] WANG L,XIONG Y,WANG Z,et al.Temporal segment networks:towards good practices for deep action recognition[C]//European Conference on Computer Vision,2016:20-36.
[6] LAN Z Z,ZHU Y,HAUPTMANN A G.Deep local video feature for action recognition[C]//Computer Vision and Pattern Recognition Workshops(CVPRW),2017:1219-1225.
[7] SUN S,KUANG Z,SHENG L,et al.Optical flow guided feature:a fast and robust motion representation for video action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018:1390-1399.
[8] PIERGIOVANNI A,RYOO M.Representation flow for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020:9945-9953.
[9] JI S,XU W,YANG M,et al.3D convolutional neural networks for human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
[10] TRAN D,BOURDEV L,FERGUS R,et al.Learning spatiotemporal features with 3D convolutional networks[C]//IEEE International Conference on Computer Vision(ICCV),2015:4489-4497.
[11] LIU K,LIU W,GAN C,et al.T-C3D:Temporal convolutional 3D network for real-time action recognition[C]//AAAI Conference on Artificial Intelligence,2018:7138-7145.
[12] CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? a new model and the kinetics dataset[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2017:4724-4733.
[13] TRAN D,WANG H,TORRESANI L,et al.A closer look at spatiotemporal convolutions for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018:6450-6459.
[14] QIU Z,YAO T,MEI T.Learning spatio-temporal representation with pseudo-3D residual networks[C]//IEEE International Conference on Computer Vision(ICCV),2017:5534-5542.
[15] ZHOU B,ANDONIAN A,OLIVA A,et al.Temporal relational reasoning in videos[C]//European Conference on Computer Vision,2018:831-846.
[16] LIN J,GAN C,HAN S.TSM:temporal shift module for efficient video understanding[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:7082-7092.
[17] ZHOU Y,SUN X,LUO C,et al.Spatiotemporal fusion in 3D CNNs:a probabilistic view[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020:9826-9835.
[18] TAO L,WANG X,YAMASAKI T.Rethinking motion representation:residual frames with 3D ConvNets for better action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020:5667-5678.
[19] XU K,BA J,KIROS R,et al.Show,attend and tell:neural image caption generation with visual attention[J].arXiv:1502.03044v2,2015.
[20] JIE H,LI S,GANG S.Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2018:7132-7141.
[21] SANGHYUN W,PARK J,YOUNG L J,et al.CBAM:convolutional block attention module[C]//European Conference on Computer Vision,2018:3-19.
[22] CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with transformers[C]//European Conference on Computer Vision,2020:213-229.
[23] GIRDHAR R,CARREIRA J,DOERSCH C,et al.Video action transformer network[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:244-253.
[24] LONG X,GAN C,DE MELO G,et al.Multimodal keyless attention fusion for video classification[C]//AAAI Conference on Artificial Intelligence,2018:7202-7209.
[25] BERTASIUS G,WANG H,TORRESANI L.Is space-time attention all you need for video understanding?[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:813-824.
[26] SOOMRO K,ZAMIR A R,SHAH M.UCF101:a dataset of 101 human actions classes from videos in the wild[J].arXiv:1212.0402,2012.
[27] KUEHNE H,JHUANG H,GARROTE E,et al.HMDB:a large video database for human motion recognition[C]//IEEE International Conference on Computer Vision(ICCV),2011:2556-2563.
[28] VAROL G,LAPTEV I,SCHMId C.Long-term temporal convolutions for action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40:1510-1517.
[29] LIU Z,YE T,WANG Z.Improving human action recognitionby temporal attention[C]//2017 IEEE International Conference on Image Processing(ICIP),2018:870-874.
[30] ZHOU Y,SUN X,ZHA Z J,et al.MiCT:mixed 3D/2D convolutional tube for human action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2018:449-458.
[31] 罗会兰,童康.时空压缩激励残差乘法网络的视频动作识别[J].通信学报,2019,40(9):189-198.
LUO H L,TONG K,YUAN P.Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition[J].Journal on Communications,2019,40(9):189-198.
[32] DIBA A,FAYYAZ M,SHARMA V,et al.Spatio-temporal channel correlation networks for action classification[C]//European Conference on Computer Vision,2019:299-315.
[33] TU Z,XIE W,DAUWELS J,et al.Semantic cues enhanced multimodality multistream CNN for action recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,29(5):1423-1437.
[34] JIANG M,PAN N,KONG J.Spatial-temporal saliency action mask attention network for action recognition[J].Journal of Visual Communication and Image Representation,2020,71:102846.
[35] MING Y,FENG F,LI C,et al.3D-TDC:A 3D temporal dilation convolution framework for video action recognition[J].Neurocomputing,2021,450:362-371.
[36] WANG Y,LIU W,XING W.Improved two-stream network for action recognition in complex scenes[C]//Artificial Intelligence and Electromechanical Automation,2021:361-365.
[37] KARPATHY A,TODERICI G,SHETTY S,et al.Large-scale video classification with convolutional neural networks[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2014:1725-1732.