Review on Human Action Recognition Methods Based on Multimodal Data

doi:10.3778/j.issn.1002-8331.2310-0090

Abstract

Abstract: Human action recognition (HAR) is widely applied in the fields of intelligent security, autonomous driving and human-computer interaction. With advances in capture equipment and sensor technology, the data that can be acquired for HAR is no longer limited to RGB data, but also multimodal data such as depth, skeleton, and infrared data. Feature extraction methods in HAR based on RGB and skeleton data modalities are introduced in detail, including handcrafted-based and deep learning-based methods. For RGB data modalities, feature extraction algorithms based on two-stream convolutional neural network (2s-CNN), 3D convolutional neural network (3DCNN) and hybrid network are analyzed. For skeleton data modalities, some popular pose estimation algorithms for single and multi-person are firstly introduced. The classification algorithms based on convolutional neural network (CNN), recurrent neural network (RNN), and graph convolutional neural network (GCN) are analyzed stressfully. A further comprehensive demonstration of the common datasets for both data modalities is presented. In addition, the current challenges are explored based on the corresponding data structure features of RGB and skeleton. Finally, future research directions for deep learning-based HAR methods are discussed.

Key words: video understanding, human action recognition, deep learning, feature extraction, pose estimation algorithms

摘要： 人体行为识别广泛应用于智能安防、自动驾驶和人机交互等领域。随着拍摄设备和传感器技术的发展，可获取用于人体行为识别的数据不再局限于RGB数据，还有深度、骨骼和红外等多模态数据。详细介绍了基于RGB和骨骼数据模态的人体行为识别任务中特征提取方法，包括基于手工标注和基于深度学习的方法。对于RGB数据模态，重点分析了基于双流卷积神经网络、3D卷积神经网络和混合网络的特征提取算法。对于骨骼数据模态，介绍了目前流行的单人和多人姿态评估算法；重点分析了基于卷积神经网络、循环神经网络和图卷积神经网络的分类算法；进一步全面展示了两种数据模态的通用数据集。此外，基于RGB和骨骼各自的数据结构特征，探讨了目前面临的挑战，最后对未来基于深度学习的人体行为识别方法的研究方向进行了展望。

关键词: 视频理解, 人体行为识别, 深度学习, 特征提取, 姿态评估算法

WANG Cailing, YAN Jingjing, ZHANG Zhidong. Review on Human Action Recognition Methods Based on Multimodal Data[J]. Computer Engineering and Applications, 2024, 60(9): 1-18.

王彩玲, 闫晶晶, 张智栋. 基于多模态数据的人体行为识别方法研究综述[J]. 计算机工程与应用, 2024, 60(9): 1-18.

References

[1] ZIAEEFARD M, BERGEVIN R. Semantic human activity recognition: a literature review[J]. Pattern Recognition, 2015, 48(8): 2329-2345.
[2] BATES T, RAMIREZ-AMARO K, INAMURA T, et al. On-line simultaneous learning and recognition of everyday activities from virtual reality performances[C]//Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017: 3510-3515.
[3] ZHEN R, SONG W, HE Q, et al. Human-computer interaction system: a survey of talking-head generation[J]. Electronics, 2023, 12(1): 218.
[4] GUO G, LAI A. A survey on still image based human action recognition[J]. Pattern Recognition, 2014, 47(10): 3343-3361.
[5] MA N, WU Z, CHEUNG Y, et al. A survey of human action recognition and posture prediction[J]. Tsinghua Science and Technology, 2022, 27(6): 973-1001.
[6] 裴利沈, 赵雪专. 群体行为识别深度学习方法研究综述[J]. 计算机科学与探索, 2022, 16(4): 775-790.
PEI L S, ZHAO X Z. Survey of collective activity recognition based on deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 775-790.
[7] FENG M, MEUNIER J. Skeleton graph-neural-network-based human action recognition: a survey[J]. Sensors, 2022, 22(6): 2091.
[8] FENG L, ZHAO Y, ZHAO W, et al. A comparative review of graph convolutional networks for human skeleton-based action recognition[J]. Artificial Intelligence Review, 2022, 55(5): 4275-4305.
[9] BOBICK A F, DAVIS J W. The recognition of human movement using temporal templates[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3): 257-267.
[10] KLASER A, MARSZA?EK M, SCHMID C. A spatio-temporal descriptor based on 3D-gradients[C]//Proceedings of the British Machine Vision Conference 2008, Leeds, UK, September, 2008: 1-10.
[11] SUN J, WU X, YAN S, et al. Hierarchical spatio-temporal context modeling for action recognition[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: 2004-2011.
[12] SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, August 23-26, 2004: 32-36.
[13] GORELICK L, BLANK M, SHECHTMAN E, et al. Actions as space-time shapes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12): 2247-2253.
[14] DUONG T V, BUI H H, PHUNG D Q, et al. Activity recognition and abnormality detection with the switching hidden semi-markov model[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005: 838-845.
[15] WU X, XU D, DUAN L, et al. Action recognition using context and appearance distribution features[C]//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, June 20-25, 2011: 489-496.
[16] FANTI C, ZELNIK-MANOR L, PERONA P. Hybrid models for human motion recognition[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, June 20-26, 2005: 1166-1173.
[17] HORN B, SCHUNCK B. Determining optical flow[J]. Artificial Intelligence, 1981, 17(1/3): 185-203.
[18] WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision, 2013: 3551-3558.
[19] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems, 2014: 568-576.
[20] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1933-1941.
[21] FEICHTENHOFER C, FAN H, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 6202-6211.
[22] 雷永升, 丁锰, 李居昊, 等. 基于改进双流视觉Transformer的行为识别模型[J/OL]. 计算机科学, 2023: 1-13[2023-10-10]. http://kns.cnki.net/kcms/detail/50.1075.TP.20231010.
1104.016.html.
LEI Y S, DING M, LI J H, et al. Action recognition model based on improved two stream vision transformer[J/OL]. Computer Science, 2023: 1-13[2023-10-10]. http://kns.cnki.net/kcms/detail/50.1075.TP.20231010.1104.016.html.
[23] 龚苏明, 陈莹. 时空特征金字塔模块下的视频行为识别[J]. 计算机科学与探索, 2022, 16(9): 2061-2067.
GONG S M, CHEN Y. Video action recognition based on spatio-temporal feature pyramid module[J]. Journal of Frontiers of Computer Science and Technology, ?2022, 16(9): 2061-2067.
[24] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231.
[25] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 4489-4497.
[26] KAREN S, ANDREW Z. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, May 7-9, 2015.
[27] CARREIRA J, ZISSERMAN A. Quo Vadis, action recognition? a new model and the Kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.
[28] QIU Z, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5533-5541.
[29] TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6450-6459.
[30] ZHOU Y, SUN X, ZHA Z J, et al. MiCT: mixed 3D/2D convolutional tube for human action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 449-458.
[31] WANG L, LI W, LI W, et al. Appearance-and-relation networks for video classification[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, June 18-22, 2018: 1430-1439.
[32] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, the Netherlands, October 11-14, 2016: 20-36.
[33] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7794-7803.
[34] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[35] ZHANG S, GUO S, HUANG W, et al. V4D: 4D convolutional neural networks for video-level representation learning[J]. arXiv:2002.07442, 2020.
[36] STROUD J C, ROSS D A, SUN C, et al. D3D: distilled 3D networks for video action recognition[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, USA, 2020: 614-623.
[37] JIANG G H, JIANG X Y, FANG Z J, et al. An efficient attention module for 3D convolutional neural networks in action recognition[J]. Applied Intelligence, 2021, 51(10): 7043-7057.
[38] KIM D H, ANVAROV F, LEE J M, et al. Metric-based attention feature learning for video action recognition[J]. IEEE Access, 2021, 9: 39218-39228.
[39] FANANY, MOHAMAD I, AHMAD A. End-to-end multi-resolution 3D capsule network for people action detection[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2022, 36(8): 1-24.
[40] ZHAO H, LIU J, WANG W J. Research on human behavior recognition in video based on 3DCCA[J]. Multimedia Tools and Applications, 2023, 82(13): 20251-20268.
[41] DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, June 7-12, 2015: 2625-2634.
[42] NG Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, June 7-12, 2015: 4694-4702.
[43] HE J Y, WU X, CHENG Z Q, et al. DB-LSTM: densely-connected bi-directional LSTM for human action recognition[J]. Neurocomputing, 2021, 444: 319-331.
[44] LI Z, GAVRILYUK K, GAVVES E, et al. VideoLSTM convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 2018, 166: 41-50.
[45] SUN L, JIA K, CHEN K, et al. Lattice long short-term memory for human action recognition[C]//Proceedings of the International Conference on Computer Vision, Venice, Italy, October 22-29, 2017: 2147-2156.
[46] GIRDHAR R, CARREIRA J, DOERSCH C, et al. Video action transformer network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 16-20, 2019: 244-253.
[47] LI X, HOU Y, WANG P, et al. Trear: transformer-based RGB-D egocentric action recognition[J]. IEEE Transactions on Cognitive and Developmental Systems, 2021, 14(1): 246-252.
[48] 武东辉, 许静, 陈继斌, 等. 基于融合注意力机制与 CNN-LSTM 的人体行为识别算法[J]. 科学技术与工程, 2023, 23(2): 681-689.
WU D H, XU J, CHEN J B, et al. Human activity recognition algorithm based on CNN-LSTM with attention mechanism[J]. Science Technology and Engineering, 2023, 23(2): 681-689.
[49] 余金锁, 卢先领. 基于分割注意力的特征融合 CNN-Bi-LSTM人体行为识别算法[J]. 电子测量与仪器学报, 2022, 36(2): 89-95.
YU J S, LU X L. Human action recognition algorithm of feature fusion CNN-Bi-LSTM based on split-attention[J]. Journal of Electronic Measurement and Instrumentation, 2022, 36(2): 89-95.
[50] CHEN W F, ZHENG F, GAO S P, et al. An LSTM with differential structure and its application in action recognition[J]. Mathematical Problems in Engineering, 2022(1).
[51] LE T H, LE T M, NGUYEN T A. Action identification with fusion of BERT and 3DCNN for smart home systems[J]. Internet of Things, 2023, 22: 100811.
[52] DEVLIN J, CHANG M W, LEE K, et al. BERT: pretraining of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers), Minneapolis, MN, USA, June 2-7, 2019. Minneapolis: Association for Computational Linguistics, 2019: 4171-4186,
[53] GENG Z, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 14676-14686.
[54] TOSHEV A, SZEGEDY C. Deeppose: human pose estimation via deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 1653-1660.
[55] CARREIRA J, AGRAWAL P, FRAGKIADAKI K, et al. Human pose estimation with iterative error feedback[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4733-4742.
[56] SUN X, SHANG J X, LIANG S, et al. Compositional human pose regression[C]//Proceedings of the International Conference on Computer Vision, Venice, Italy, October 22-29, 2017: 2621-2630.
[57] MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2640-2649.
[58] PAPANDREOU G, ZHU T, KANAZAWA N, et al. Towards accurate multi-person pose estimation in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 21-26, 2017: 3711-3719.
[59] HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]//Proceedings of the International Conference on Computer Vision, Venice, Italy, October 22-29, 2017: 2980-2988.
[60] FANG H S, XIE S Q, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]//Proceedings of the International Conference on Computer Vision, Venice, Italy, October 22-29, 2017: 2353-2362.
[61] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 16-20, 2019: 5686-5696.
[62] ZHANG K, HE P, YAO P, et al. DNANet: de-normalized attention based multi-resolution network for human pose estimation[J]. arXiv:1909.05090, 2019.
[63] PISHCHULIN L, INSAFUTDINOV E, TANG S Y, et al. DeepCut: joint subset partition and labeling for multi person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, June 27-30, 2016: 4929-4937.
[64] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 21-26, 2017: 1302-1310.
[65] OSOKIN D. Real-time 2D multi-person pose estimation on CPU: lightweight openpose[J]. arXiv:1811.12004, 2018.
[66] KREISS S, BERTONI L, ALAHI A. PifPaf: composite fields for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 16-20, 2019: 11977-11986.
[67] CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the 2020 Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 13-19, 2020: 5386-5395.
[68] YANG X, ZHANG C, TIAN YL. Recognizing actions using depth motion maps-based histograms of oriented gradients[C]//Proceedings of the 20th ACM International Conference on Multimedia, 2012: 1057-1060.
[69] VEMULAPALLI R, ARRATE F, CHELLAPPA R. Human action recognition by representing 3d skeletons as points in a lie group[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, June 23-28, 2014: 588-595.
[70] SU B Y, WU H, SHENG M, et al. Accurate hierarchical human action recognition from Kinect skeleton data[J]. IEEE Access, 2019, 7: 52532-52541.
[71] 李梦荷, 许宏吉, 石磊鑫, 等. 基于骨骼关键点检测的多人行为识别[J]. 计算机科学, 2021, 48(4): 138-143.
LI M H, XU H J, SHI L X, et al. Multi-person activity recognition based on bone keypoints detection[J]. Computer Science, 2021, 48(4): 138-143.
[72] LI S, LI W, COOK C, et al. Independently recurrent neural network (IndRNN): building a longer and deeper RNN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, June 18-22, 2018: 5457-5466.
[73] LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 1012-1020.
[74] LIU J, WANG G, DUAN L Y, et al. Skeleton-based human action recognition with global context-aware attention LSTM networks[J]. IEEE Transactions on Image Processing, 2017, 27(4): 1586-1599.
[75] ZHU W, LAN C, XING J, et al. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016: 3697-3704.
[76] SI C, CHEN W, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 16-20, 2019: 1227-1236.
[77] ZHENG W, LI L, ZHANG Z, et al. Relational network for skeleton-based action recognition[C]//Proceedings of the 2019 IEEE International Conference on Multimedia and Expo, 2019: 826-831.
[78] 高治军, 顾巧瑜, 陈平, 等. 基于 CNN-LSTM 双流融合网络的危险行为识别[J]. 数据采集与处理, 2023, 38(1): 132-140.
GAO Z J, GU Q Y, CHEN P, et al. Dangerous behavior recognition based on CNN-LSTM dual-stream fusion network[J]. Journal of Data Acquisition & Processing, 2023, 38(1): 132-140.
[79] DU Y, FU Y, WANG L. Skeleton based action recognition with convolutional neural network[C]//Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, Kuala Lumpur, Malaysia, November 3-6, 2015: 579-583.
[80] KE Q, BENNAMOUN M, AN S, et al. A new representation of skeleton sequences for 3D action recognition[C]//Proceedings of the IEEE Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 21-26, 2017: 3288-3297.
[81] LI C, ZHONG Q, XIE D, et al. Skeleton-based action recognition with convolutional neural networks[C]//Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops, 2017: 597-600.
[82] CAETANO C, SENA J, BRéMOND F, et al. Skelemotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition[C]//Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2019: 1-8.
[83] CAETANO C, BRéMOND F, SCHWARTZ W R. Skeleton image representation for 3D action recognition based on tree structure and reference joints[C]//Proceedings of the 32nd SIBGRAPI Conference on Graphics, Patterns and Images, Rio de Janeiro, Brazil, October 28-30, 2019: 16-23.
[84] DUAN H, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 2969-2978.
[85] 陈泯融, 彭俊杰, 曾国强. 基于多流融合网络的3D骨架人体行为识别[J]. 华南师范大学学报 (自然科学版), 2023, 55(1): 94-101.
CHEN M R, PENG J J, ZENG G Q. 3D skeleton-based human action recognition based on multi-stream fusion network[J]. Journal of South China Normal University (Natural Science Edition), 2023, 55(1): 94-101.
[86] BRUNA J, ZAREMBA W, SZLAM A, et al. Spectral networks and locally connected networks on graphs[J]. arXiv:1312.6203, 2013.
[87] LI R, TAPASWI M, LIAO R, et al. Situation recognition with graph neural networks[C]//Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, October 22-29, 2017: 4183-4192.
[88] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[J]. arXiv:1609.02907, 2016.
[89] MAZARI A, SAHBI H. MLGCN: multi-Laplacian graph convolutional networks for human action recognition[C]//Proceedings of the British Machine Vision Conference (BMVC), Cardiff, UK, 2019.
[90] HUANG L, HUANG Y, OUYANG W, et al. Part-level graph convolutional network for skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11045-11052.
[91] YAN S, XIONG Y, LIN D. Spatial temporal graph convolutional networks for skeleton-based action recognition[J]. arXiv:1801.07455, 2018.
[92] BRIDLE J S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition[M]//Neurocomputing: algorithms, architectures and applications. Berlin, Heidelberg: Springer, 1990: 227-236.
[93] SHI L, ZHANG Y, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 12026-12035.
[94] SHIRAKI K, HIRAKAWA T, YAMASHITA T, et al. Spatial temporal attention graph convolutional networks with mechanics-stream for skeleton-based action recognition[C]//Proceedings of the IEEE Conference on Asian Conference on Computer Vision, 2020: 341-357.
[95] SHI L, ZHANG Y, CHENG J, et al. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks[J]. IEEE Transactions on Image Processing, 2020, 29: 9532-9545.
[96] HUANG J, XIANG X, GONG X, et al. Long-short graph memory network for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020: 645-652.
[97] SONG Y F, ZHANG Z, SHAN C, et al. Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 1625-1633.
[98] THAKKAR K, NARAYANAN P J. Part-based graph convolutional network for action recognition[J]. arXiv:1809.04983, 2018.
[99] LI B, LI X, ZHANG Z, et al. Spatio-temporal graph routing for skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 8561-8568.
[100] 曹毅, 夏宇, 高清源, 等. 基于超连接图卷积网络的骨架行为识别方法[J/OL]. 吉林大学学报 (工学版): 1-9[2023-10-18]. DOI:10.13229/j.cnki.jdxbgxb.20230440.
CAO Y, XIA Y, GAO Q Y, et al. Skeleton-based action recognition based on hyper-connected graph convolutional network[J/OL]. Journal of Jilin University (Engineering and Technology Edition): 1-9[2023-10-18]. DOI:10.13229/j.cnki.jdxbgxb.20230440.
[101] 白杉, 冯秀芳. 基于注意力增强的中心差分自适应图卷积的骨架行为识别[J]. 计算机工程与科学, 2023, 45(7): 1263-1273.
BAI S, FENG X F. Skeleton behavior recognition based on attention-enhanced central difference adaptive graph convolution[J]. Computer Engineering and Science, 2023, 45(7): 1263-1273.
[102] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision, 2011: 2556-2563.
[103] SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012.
[104] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 1725-1732.
[105] HEILBRON F C, ESCORCIA V, GHANEM B, et al. Activitynet: a large-scale video benchmark for human activity understanding[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 961-970.
[106] CARREIRA J, NOLAND E, BANKI-HORVATH A, et al. A short note about Kinetics-600[J]. arXiv:1808.01340, 2018.
[107] MONFORT M, ANDONIAN A, ZHOU B, et al. Moments in time dataset: one million videos for event understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(2): 502-508.
[108] ZHAO H, TORRALBA A, TORRESANI L, et al. HACS: human action clips and segments dataset for recognition and temporal localization[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 8668-8678.
[109] DIBA A, FAYYAZ M, SHARMA V, et al. Large scale holistic video understanding[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, August 23-28, 2020: 593-610.
[110] PIERGIOVANNI A J, RYOO M. AViD dataset: anonymized videos from diverse countries[C]//Advances in Neural Information Processing Systems, 2020: 16711-16721.
[111] KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics human action video dataset[J]. arXiv:1705.06950, 2017.
[112] CARREIRA J, NOLAND E, HILLIER C, et al. A short note on the Kinetics-700 human action dataset[J]. arXiv:1907.
06987, 2019.
[113] LI A, THOTAKURI M, ROSS D A, et al. The AVA-Kinetics localized human actions video dataset[J]. arXiv:2005.00214, 2020.
[114] SMAIRA L, CARREIRA J, NOLAND E, et al. A short note on the Kinetics-700-2020 human action dataset[J]. arXiv:2010.10864, 2020.
[115] KOPPULA H S, GUPTA R, SAXENA A. Learning human activities and object affordances from RGB-D videos[J]. The International Journal of Robotics Research, 2013, 32(8): 951-970.
[116] RAHMANI H, MAHMOOD A, QHUYNH D, et al. HOPC: histogram of oriented principal components of 3D pointclouds for action recognition[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, September 6-12, 2014: 742-757.
[117] SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1010-1019.
[118] HU J F, ZHENG W S, LAI J, et al. Jointly learning heterogeneous features for RGB-D activity recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 5344-5352.
[119] PARSA B, SAMANI E U, HENDRIX R, et al. Toward ergonomic risk prediction via segmentation of indoor object manipulation actions using spatiotemporal convolutional networks[J]. IEEE Robotics and Automation Letters, 2019, 4(4): 3153-3160.
[120] LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(10): 2684-2701.
[121] LIN W, LIU H, LIU S, et al. Human in events: a large-scale benchmark for human-centric video analysis in complex events[J]. arXiv:2005.04490, 2020.
[122] SUN D, VLASIC D, HERRMANN C, et al. Autoflow: learning a better training set for optical flow[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 10088-10097.
[123] YUN S, OH S J, HEO B, et al. Videomix: rethinking data augmentation for video classification[J]. arXiv:2012.03457, 2020.
[124] ZOU Y, CHOI J, WANG Q, et al. Learning representational invariances for data-efficient action recognition[J]. Computer Vision and Image Understanding, 2023, 227: 103597.
[125] ZHANG Y, JIA G, CHEN L, et al. Self-paced video data augmentation by generative adversarial networks with insufficient samples[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 1652-1660.
[126] GOWDA S N, ROHRBACH M, KELLER F, et al. Learn2Augment: learning to composite videos for data augmentation in action recognition[C]//Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 2022: 242-259.
[127] GOYAL P, DOLLáR P, GIRSHICK R, et al. Accurate, large minibatch sgd: training imagenet in 1 hour[J]. arXiv:1706.02677, 2017.
[128] LIN J, GAN C, HAN S. Training Kinetics in 15 minutes: large-scale distributed training on videos[J]. arXiv:1910.
00932, 2019.
[129] HOWARD A, SANDLER M, CHU G, et al. Searching for mobilenetv3[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 1314-1324.
[130] ZHANG X, ZHOU X, LIN M, et al. Shufflenet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6848-6856.
[131] GABEUR V, SUN C, ALAHARI K, et al. Multi-modal transformer for video retrieval[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, August 23-28, 2020: 214-229.
[132] PIERGIOVANNI A J, RYOO M. Learning multimodal representations for unseen activities[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020: 517-526.
[133] ZHU L, YANG Y. ActBERT: Learning global-local video-text representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8746-8755.
[134] ROUDITCHENKO A, BOGGUST A, HARWATH D, et al. AVLnet: learning audio-visual language representations from instructional videos[J]. arXiv:2006.09199, 2020.
[135] ALAYRAC J B, RECASENS A, SCHNEIDER R, et al. Self-supervised multimodal versatile networks[C]//Advances in Neural Information Processing Systems, 2020: 25-37.