基于RGB与骨骼数据的人体行为识别综述

doi:10.3778/j.issn.1002-8331.2407-0456

摘要/Abstract

摘要： 人体行为识别是计算机视觉领域中的重要研究方向，在人机交互、医疗康复、自动驾驶等领域具有广泛应用和重大意义。由于其方法的重要性和前沿性，对该领域进行全面、系统地总结具有极其重要的意义。深入探讨了基于RGB和骨骼数据模态的人体行为识别方法；按照特征学习方式的不同，分为基于传统机器学习的手工特征提取方法和基于深度学习的深度特征提取方法。介绍了行为识别的基本流程，并总结了公开数据集。详述了基于RGB和骨骼数据模态的识别方法。对于RGB数据，分析了基于2D CNN、RNN和3D CNN的特征提取方法；对于骨骼数据，介绍了自上而下和自下而上的姿态评估算法，重点分析了基于RNN、CNN、GCN、Transformer和混合神经网络的分类算法。最后，展望了未来深度学习在人体行为识别中的五个研究方向。

关键词: 行为识别, 计算机视觉, RGB数据, 骨骼数据, 特征提取, 深度学习

Abstract: Human behavior recognition is an important research direction in the field of computer vision, which is widely used and of great significance in the fields of human-computer interaction, medical rehabilitation, and automatic driving. Due to the importance and cutting-edge of its methodology, a comprehensive and systematic summary of the field is of utmost importance. In this paper, human behavior recognition methods based on RGB and skeletal data modalities are discussed in depth. According to the difference of feature learning method,human behavior recognition methods can be divided into manual feature extraction method based on traditional machine learning and deep feature extraction method based on deep learning. Firstly, the basic process of behavior recognition is introduced and the publicly available datasets are summarized. Then, the recognition methods based on RGB and skeletal data modalities are detailed. For RGB data, feature extraction methods based on 2D convolutional neural networks, recurrent neural networks, and 3D convolutional neural networks are analyzed. For skeletal data, top-down and bottom-up pose evaluation algorithms are presented, with a focus on analyzing classification algorithms based on convolutional neural networks, recurrent neural networks, graph convolutional neural networks, Transformer and hybrid neural networks. Finally, five future research directions for deep learning in human behavior recognition are envisioned.

Key words: behavior recognition, computer visualization, RGB data, skeletal data, feature extraction, deep learning

李仝伟, 仇大伟, 刘静, 逯英航. 基于RGB与骨骼数据的人体行为识别综述[J]. 计算机工程与应用, 2025, 61(8): 62-82.

LI Tongwei, QIU Dawei, LIU Jing, LU Yinghang. Review of Human Behavior Recognition Based on RGB and Skeletal Data[J]. Computer Engineering and Applications, 2025, 61(8): 62-82.

参考文献

[1] ALNUAIM A A, ZAKARIAH M, SHUKLA P K, et al. Human-computer interaction for recognizing speech emotions using multilayer perceptron classifier[J]. Journal of Healthcare Engineering, 2022, 2022(1): 6005446.
[2] 张毅, 黄聪, 罗元. 基于改进朴素贝叶斯分类器的康复训练行为识别方法[J]. 计算机应用, 2013, 33(11): 3187-3189.
ZHANG Y, HUANG C, LUO Y. Behavior recognition in rehabilitation training based on modified naive Bayes classifier[J]. Journal of Computer Applications, 2013, 33(11): 3187-3189.
[3] 刘延伟, 黄志明, 高博麟, 等. 车载视角下基于视觉信息的前车行为识别[J]. 汽车安全与节能学报, 2023, 14(6): 707-714.
LIU Y W, HUANG Z M, GAO B L, et al. Recognition of front vehicle behavior based on visual information from vehicle perspective[J]. Journal of Automotive Safety and Energy, 2023, 14(6): 707-714.
[4] RAHMANI H, BENNAMOUN M, KE Q. Human action recognition from various data modalities: a review[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 45(3): 3200-3225.
[5] REN B, LIU M, DING R, et al. A survey on 3D skeleton-based action recognition using learning method[J]. arXiv:2002.05907, 2020.
[6] 黄倩, 崔静雯, 李畅. 基于骨骼的人体行为识别方法研究综述[J]. 计算机辅助设计与图形学学报, 2024, 36(2): 173-194.
HUANG Q, CUI J W, LI C. A review of skeleton-based human action recognition[J]. Journal of Computer-Aided Design & Computer Graphics, 2024, 36(2): 173-194.
[7] 卢健, 李萱峰, 赵博, 等. 骨骼信息的人体行为识别综述[J]. 中国图象图形学报, 2023, 28(12): 3651-3669.
LU J, LI X F, ZHAO B, et al. A review of skeleton-based human action recognition[J]. Journal of Image and Graphics, 2023, 28(12): 3651-3669.
[8] NI W C, ZHANG B T, ZHANG J T, et al. Target perception and behavioral recognition algorithms based on saliency and feature extraction[J]. IEEE Access, 2023, 12: 6790-6798.
[9] XU J K, PAN C. Human behavior recognition based on attention mechanism and bottleneck residual dual-path spatiotemporal graph convolutional network[C]//Proceedings of the 4th International Conference on Neural Networks, Information and Communication Engineering, 2024: 807-812 .
[10] ZHENG P F, ZHANG A X, CHEN J Z, et al. Real-time fall recognition using a lightweight convolution neural network based on millimeter-wave radar[J]. IEEE Sensors Journal, 2024, 24(5): 7185-7195.
[11] KUEHNE H, JHUANG H, STIEFELHAGEN R, et al. HMDB51: a large video database for human motion recognition[C]//Proceedings of the High Performance Computing in Science and Engineering, 2013: 571-582.
[12] SOOMRO K, ZAMIR A, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012.
[13] REDDY K K, SHAH M. Recognizing 50 human action categories of web videos[J]. Machine Vision and Applications, 2013, 24(5): 971-981.
[14] KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics human action video dataset[J]. arXiv:1705.06950, 2017.
[15] CARREIRA J, NOLAND E, BANKI-HORVATH A, et al. A short note about Kinetics-600[J]. arXiv:1808.01340, 2018.
[16] CARREIRA J, NOLAND E, HILLIER C, et al. A short note on the Kinetics-700 human action dataset[J]. arXiv:1907. 06987, 2019.
[17] GOYAL R, KAHOU S E, MICHALSKI V, et al. The “Something Something” video database for learning and evaluating visual common sense[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 5843-5851.
[18] DAMEN D M, DOUGHTY H, FARINELLA G M, et al. Scaling egocentric vision: the EQIC-KITCHENS dataset[J]. arXiv:1804.02748, 2018.
[19] HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 961-970.
[20] SUNG J, PONCE C, SELMAN B, et al. Human activity detection from RGBD images[C]//Proceedings of the 16th AAAI Conference on Plan, Activity, and Intent Recognition, 2011: 47-55.
[21] KOPPULA H S, GUPTA R, SAXENA A. Learning human activities and object affordances from RGB-D videos[J]. International Journal of Robotics Research, 2013, 32(8): 951-970.
[22] LIU Z, ZHANG C Y, TIAN Y L. 3D-based deep convolutional neural network for action recognition with depth sequences[J]. Image and Vision Computing, 2016, 55: 93-100.
[23] SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1010-1019.
[24] LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 2684-2701.
[25] LI W Q, ZHANG Z Y, LIU Z C. Action recognition based on a bag of 3D points[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010: 9-14.
[26] WANG J, LIU Z C, WU Y, et al. Mining actionlet ensemble for action recognition with depth cameras[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: 1290-1297.
[27] YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence, 2018: 7444-7452.
[28] HU J F, ZHENG W S, LAI J H, et al. Jointly learning heterogeneous features for RGB-D activity recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(11): 2186-2200.
[29] WANG J, NIE X H, XIA Y, et al. Cross-view action modeling, learning, and recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 2649-2656.
[30] RAHMANI H, MAHMOOD A, HUYNH D, et al. Histogram of oriented principal components for cross-view action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(12): 2430-2443.
[31] BOBICK A F, DAVIS J W. The recognition of human movement using temporal templates[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3): 257-267.
[32] DAVIS J W, BOBICK A F. The representation and recognition of human movement using temporal templates[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997: 928-934.
[33] WANG H, ULLAH M M, KLASER A, et al. Evaluation of local spatio-temporal features for action recognition[C]//Proceedings of the British Machine Vision Conference, 2009.
[34] EVERTS I, GEMERT V J C, GEVERS T. Evaluation of color spatio-temporal interest points for human action recognition[J]. IEEE Transactions on Image Processing, 2014, 23(4): 1569-1580.
[35] ZHU Y, CHEN W B, GUO G D. Evaluating spatiotemporal interest point features for depth-based action recognition[J]. Image and Vision Computing, 2014, 32(8): 453-464.
[36] WANG H, KL?SER A, SCHMID C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013, 103(1): 60-79.
[37] WANG H, KL?SER A, SCHMID C, et al. Action recognition by dense trajectories[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011: 3169-3176.
[38] VIG E, DORR M, COX D. Space-variant descriptor sampling for action recognition based on saliency and eye movements[C]//Proceedings of the European Conference on Computer Vision, 2012: 84-97.
[39] SIMONYAN K, ZISSERMAN A, SIMONYAN K, et al. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 2014: 568-576.
[40] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014: 1725-1732.
[41] ZONG M, WANG R L, CHEN X B, et al. Motion saliency based multi-stream multiplier ResNets for action recognition[J]. Image and Vision Computing, 2021, 107: 104108.
[42] BILEN H, FERNANDO B, GAVVES E, et al. Action recognition with dynamic image networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12): 2799-2813.
[43] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[J]. arXiv:1608.00859, 2016.
[44] DONAHUE J, HENDRICKS L A, ROHRBACH M, et al. Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 677-691.
[45] HE J Y, WU X, CHENG Z Q, et al. DB-LSTM: densely-connected bi-directional LSTM for human action recognition[J]. Neurocomputing, 2021, 444: 319-331.
[46] SHARMA S, KIROS R, SALAKHUTDINOV R. Action recognition using visual attention[J]. arXiv:1511.04119, 2015.
[47] MENG L L, ZHAO B, CHANG B, et al. Interpretable spatio-temporal attention for video action recognition[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop, 2019: 1513-1522.
[48] LI Z Y, GAVRILYUK K, GAVVES E, et al. VideoLSTM convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 2018, 166: 41-50.
[49] LIU Z B, LI Z Y, WANG R L, et al. Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition[J]. Neural Computing and Applications, 2020, 32(18): 14593-14602.
[50] WU Z X, WANG X, JIANG Y G, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification[C]//Proceedings of the 23rd ACM International Conference on Multimedia, 2015: 461-470.
[51] JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231.
[52] ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding[J]. arXiv:1804.09066, 2018.
[53] VAROL G, LAPTEV I, SCHMID C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510-1517.
[54] LI X Y, SHUAI B, TIGHE J. Directional temporal modeling for action recognition[J]. arXiv:2007.11040, 2020.
[55] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition?a new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4724-4733.
[56] QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5534-5542.
[57] LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 7082-7092.
[58] WANG Z W, SHE Q, SMOLIC A. ACTION-Net: multipath excitation for action recognition[J]. arXiv:2103.07372, 2021.
[59] KIM J, CHA S, WEE D, et al. Regularization on spatio-temporally smoothed feature for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12100-12109.
[60] LI K C, LI X H, WANG Y, et al. CT-Net: channel tensorization network for video classification[J]. arXiv:2106.01603, 2021.
[61] 钱惠敏, 陈实, 皇甫晓瑛. 基于双流-非局部时空残差卷积神经网络的人体行为识别[J]. 电子与信息学报, 2024, 46(3): 1100-1108.
QIAN H M, CHEN S, HUANGFU X Y. Human activities recognition based on two-stream NonLocal spatial temporal residual convolution neural network[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1100-1108.
[62] FANG H S, LI J F, TANG H Y, et al. AlphaPose: whole-body regional multi-person pose estimation and tracking in real-time[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7157-7173.
[63] LYU X Y, WANG S Y, CHEN T, et al. Human gait analysis method based on sample entropy fusion AlphaPose algorithm[C]//Proceedings of the 33rd Chinese Control and Decision Conference, 2021: 1543-1547.
[64] YANG J K, HE Y Q, ZHU J X, et al. Fall detection method for infrared videos based on spatial-temporal graph convolutional network[J]. Sensors, 2024, 24(14): 4647.
[65] CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1302-1310.
[66] LIU H C, MEI J Q, JIA F, et al. AuraPose: accurate human pose detection and behavior recognition via enhanced OpenPose with angular measurement[C]//Proceedings of the IEEE International Instrumentation and Measurement Technology Conference, 2024: 1-6.
[67] CHEN Y Y, ZHANG J, WANG Y L. Human action recognition and analysis methods based on OpenPose and deep learning[C]//Proceedings of the International Conference on Integrated Circuits and Communication Systems, 2024: 1-5.
[68] JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision, 2013: 3192-3199.
[69] HUSSEIN M E, TORKI M, GOWAYYED M A, et al. Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations[C]//Proceedings of the 23rd International Joint Conference on Artificial Intelligence, 2013: 2466-2472.
[70] ZHOU Q Y, YU S Q, WU X Y, et al. HMMs-based human action recognition for an intelligent household surveillance robot[C]//Proceedings of the IEEE International Conference on Robotics and Biomimetics, 2009: 2295-2300.
[71] YOU Y X, LIU H, WANG T, et al. Co-evolution of pose and mesh for 3D human body estimation from video[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, 2023: 14917-14927.
[72] CHAUDHRY R, OFLI F, KURILLO G, et al. Bio-inspired dynamic 3D discriminative skeletal features for human action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013: 471-478.
[73] DU Y, WANG W, WANG L. Hierarchical recurrent neural network for skeleton based action recognition[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1110-1118.
[74] LI W B, WEN L Y, CHANG M C, et al. Adaptive RNN tree for large-scale human action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 1453-1461.
[75] WANG H S, WANG L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3633-3642.
[76] LIU J, SHAHROUDY A, XU D, et al. Spatio-temporal LSTM with trust gates for 3D human action recognition[J]. arXiv:1607.07043, 2016.
[77] VEERIAH V, ZHUANG N F, QI G J. Differential recurrent neural networks for action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 4041-4049.
[78] LEE I, KIM D, KANG S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 1012-1020.
[79] LI C, XIE C Y, ZHANG B C, et al. Memory attention networks for skeleton-based action recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(9): 4800-4814.
[80] LIU J, WANG G, HU P, et al. Global context-aware attention LSTM networks for 3D action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3671-3680.
[81] GAO Y B, LI C K, LI S, et al. Variable rate independently recurrent neural network (IndRNN) for action recognition[J]. Applied Sciences, 2022, 12(7): 3281.
[82] 高治军, 顾巧瑜, 陈平, 等. 基于CNN-LSTM双流融合网络的危险行为识别[J]. 数据采集与处理, 2023, 38(1): 132-140.
GAO Z J, GU Q Y, CHEN P, et al. Dangerous behavior recognition based on CNN-LSTM dual-stream fusion network[J]. Journal of Data Acquisition and Processing, 2023, 38(1): 132-140.
[83] LI C, ZHONG Q Y, XIE D, et al. Skeleton-based action recognition with convolutional neural networks[C]//Proceedings of the IEEE International Conference on Multimedia & Expo Workshops, 2017: 597-600.
[84] CAETANO C, SENA J, BRéMOND F, et al. SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition[C]//Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2019: 1-8.
[85] KE Q H, AN S J, BENNAMOUN M, et al. SkeletonNet: mining deep part features for 3D action recognition[J]. IEEE Signal Processing Letters, 2017, 24(6): 731-735.
[86] LI C, ZHONG Q Y, XIE D, et al. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018: 786-792.
[87] BANERJEE A, SINGH P K, SARKAR R. Fuzzy integral-based CNN classifier fusion for 3D skeleton action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(6): 2206-2216.
[88] WANG P C, LI Z Y, HOU Y H, et al. Action recognition based on joint trajectory maps using convolutional neural networks[C]//Proceedings of the 24th ACM International Conference on Multimedia, 2016: 102-106.
[89] LI B, DAI Y C, CHENG X L, et al. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN[C]//Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops, 2017: 601-604.
[90] LI Y S, XIA R J, LIU X, et al. Learning shape-motion representations from geometric algebra spatio-temporal model for skeleton-based action recognition[C]//Proceedings of the IEEE International Conference on Multimedia and Expo, 2019.
[91] 梁成武, 杨杰, 胡伟, 等. 基于时间动态帧选择与时空图卷积的可解释骨架行为识别[J]. 图学学报, 2024, 45(4): 791-803.
LIANG C W, YANG J, HU W, et al. Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition[J]. Journal of Graphic, 2024, 45(4): 791-803.
[92] 赵登阁, 智敏. 用于人体动作识别的多尺度时空图卷积算法[J]. 计算机科学与探索, 2023, 17(3): 719-732.
ZHAO D G, ZHI M. Spatial multiple-temporal graph convolutional neural network for human action recognition[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(3): 719-732.
[93] LI M, CHEN S, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3590-3598.
[94] SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 12018-12027.
[95] SHI L, ZHANG Y F, CHENG J, et al. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks[J]. IEEE Transactions on Image Processing, 2020, 29: 9532-9545.
[96] CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 180-189.
[97] SONG Y F, ZHANG Z, SHAN C F, et al. Constructing stronger and faster baselines for skeleton-based action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(2): 1474-1488.
[98] JIANG Y J, DENG H M. Lighter and faster: a multi-scale adaptive graph convolutional network for skeleton-based action recognition[J]. Engineering Applications of Artificial Intelligence, 2024, 132: 107957.
[99] 王琪, 何宁. 融合内在拓扑与多尺度时间特征的骨架动作识别[J]. 计算机工程与应用, 2025, 61(4): 150-157.
WANG Q, HE N. Skeleton action recognition by integrating intrinsic topology and multiscale time features[J]. Computer Engineering and Applications, 2025, 61(4): 150-157.
[100] WANG K X, DENG H M. TFC-GCN: lightweight temporal feature cross-extraction graph convolutional network for skeleton-based action recognition[J]. Sensors, 2023, 23(12): 5593.
[101] YANG S, WANG X H, GAO L L, et al. MKE-GCN: multi-modal knowledge embedded graph convolutional network for skeleton-based action recognition in the wild[C]//Proceedings of the IEEE International Conference on Multimedia and Expo, 2022: 1-6.
[102] KANG M S, KANG D, KIM H. Efficient skeleton-based action recognition via joint-mapping strategies[C]//Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision, 2023: 3392-3401.
[103] LIU Z, ZHANG H, CHEN Z, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[J]. arXiv:2003.14111, 2020.
[104] CHEN Y X, ZHANG Z Q, YUAN C F, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 13339-13348.
[105] CHEN Z, LI S C, YANG B, et al. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition[J]. arXiv:2206.13028, 2022.
[106] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[107] MA Y J, WANG R L. Relative-position embedding based spatially and temporally decoupled Transformer for action recognition[J]. Pattern Recognition, 2024, 145: 109905.
[108] MAZZIA V, ANGARANO S, SALVETTI F, et al. Action Transformer: a self-attention model for short-time pose-based human action recognition[J]. Pattern Recognition, 2022, 124: 108487.
[109] ZHAO Z F, CHEN Z W, LI J N, et al. STDM-Transformer: space-time dual multi-scale transformer network for skeleton-based action recognition[J]. Neurocomputing, 2024, 563: 126903.
[110] PLIZZARI C, CANNICI M, MATTEUCCI M. Spatial temporal transformer network for skeleton-based action recognition[C]//Proceedings of the 2021 China Automation Congress, 2021: 7029-7034.
[111] QIU H, HOU B, REN B, et al. Spatio-temporal tuples transformer for skeleton-based action recognition[J]. arXiv:2201.02849, 2022.
[112] ZHOU Y, CHENG Z Q, LI C, et al. Hypergraph transformer for skeleton-based action recognition[J]. arXiv:2211.09590, 2022.
[113] WANG L, KONIUSZ P. 3Mformer: multi-order multi-mode transformer for skeletal action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 5620-5631.
[114] GUPTA D, SINGH A K, GUPTA N, et al. SDL-Net: a combined CNN & RNN human activity recognition model[C]//Proceedings of the International Conference in Advances in Power, Signal, and Information Technology, 2023: 1-5.
[115] ZHAO H, JIN X Y. Human action recognition based on improved fusion attention CNN and RNN[C]//Proceedings of the 5th International Conference on Computational Intelligence and Applications, 2020: 108-112.
[116] YANG W J, ZHANG J L, CAI J J, et al. HybridNet: integrating GCN and CNN for skeleton-based action recognition[J]. Applied Intelligence, 2023, 53(1): 574-585.
[117] LIU K, GAO L, KHAN N M, et al. A two-stream heterogeneous network for action recognition based on skeleton and RGB modalities[C]//Proceedings of the IEEE International Symposium on Multimedia, 2021.
[118] YANG F, LI D W, WANG G. Spatial temporal block transformer network for skeleton-based action recognition[C]//Proceedings of the China Automation Congress, 2022: 1259-1264.
[119] KONG J, BIAN Y H, JIANG M. MTT: multi-scale temporal transformer for skeleton-based action recognition[J]. IEEE Signal Processing Letters, 2022, 29: 528-532.
[120] ZHENG N G, WEN J, LIU R S, et al. Unsupervised representation learning with long-term dynamics for skeleton based action recognition[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 31th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence, 2018: 2644-2651.
[121] YANG S Y, LIU J, LU S J, et al. Self-supervised 3D action representation learning with skeleton cloud colorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(1): 509-524.
[122] LIN W, DING X H, HUANG Y, et al. Self-supervised video-based action recognition with disturbances[J]. IEEE Transactions on Image Processing, 2023, 32: 2493-2507.
[123] LIN Y Z, GUO X, LU Y. Self-supervised video representation learning with meta-contrastive network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 8219-8229.
[124] COSKUN H, ZEESHAN Z M, TEKIN B, et al. Domain-specific priors and meta learning for few-shot first-person action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 6659-6673.
[125] DING S Y, CHEN Z, ZHENG T Y, et al. RF-Net: a unified meta-learning framework for RF-enabled one-shot human activity recognition[J]. arXiv:2111.04566, 2021.