Skeleton Action Recognition by Integrating Intrinsic Topology and Multi-Scale Time Features

doi:10.3778/j.issn.1002-8331.2309-0467

Abstract

Abstract: Graph convolutional networks play a crucial role in skeleton based human action recognition tasks. In order to solve the problems of existing graph convolutional networks ignoring intrinsic relationships, limited time convolution function, and insufficient exploration of potential functional correlations between joints and bones, a skeleton action recognition method integrating intrinsic topology and multi-scale time features is proposed. In order to infer the intrinsic topological relationships of the context, the model utilizes multi-head self-attention mechanism and shared topology to construct an intrinsic topological space graph convolution module. A multi-scale time convolution module is constructed based on complex action sequence analysis, aiming to expand the time convolution structure and capture multi-scale time features. The model builds a bridge for the interaction of joint and bone information, achieving effective transmission and fusion of both information, in order to further explore the functional correlation between them. The proposed method is validated, on the NTU-RGB+D 60 dataset, achieving a recognition accuracy of 91.5% for CS benchmark and 96.9% for CV benchmark, on the NTU-RGB+D 120 dataset, achieving an accuracy of 89.0% for C-Sub benchmark and 90.8% for C-Set benchmark, respectively. The experimental results show that the proposed method can more effectively extract skeleton spatio-temporal features and improve recognition accuracy.

Key words: skeleton action recognition, graph convolution, intrinsic topology, multi-scale, information fusion

摘要： 图卷积网络在基于骨架的人体动作识别任务中发挥着关键作用。为了解决现有的图卷积网络忽略内在关系，时间卷积功能受限，以及未能充分探索关节与骨骼之间潜在功能相关性等问题，提出一种融合内在拓扑与多尺度时间特征的骨架动作识别方法。为推断上下文内在拓扑关系，模型利用多头自注意力机制和共享拓扑构建内在拓扑空间图卷积模块；基于复杂的动作序列分析构建多尺度时间卷积模块，旨在扩展时间卷积结构并捕捉多尺度时间特征；模型搭建关节和骨骼信息交互桥梁，实现两者信息的有效传输和融合，以便更深入地探索它们之间的功能相关性。对所提出的方法进行验证，在NTU-RGB+D 60数据集上取得了CS基准91.5%和CV基准96.9%的识别准确率，在NTU-RGB+D 120数据集上分别取得了C-Sub基准89.0%和C-Set基准90.8%的准确率。实验结果表明所提出方法能够更加有效地提取骨架时空特征，进而提升识别精度。

关键词: 骨架动作识别, 图卷积, 内在拓扑, 多尺度, 信息融合

WANG Qi, HE Ning. Skeleton Action Recognition by Integrating Intrinsic Topology and Multi-Scale Time Features[J]. Computer Engineering and Applications, 2025, 61(4): 150-157.

王琪, 何宁. 融合内在拓扑与多尺度时间特征的骨架动作识别[J]. 计算机工程与应用, 2025, 61(4): 150-157.

References

[1] REN B, LIU M, DING R, et al. A survey on 3D skeleton-based action recognition using learning method[J]. arXiv:2002.05907, 2020.
[2] LI S, LI W, COOK C, et al. Independently recurrent neural network (INDRNN): building a longer and deeper RNN[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018: 5457-5466.
[3] CAETANO C, BREMOND F, SCHWARTZ W R. Skeleton image representation for 3D action recognition based on tree structure and reference joints[C]//Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images, 2019: 16-23.
[4] YAN S, XIONG Y, LIN D. Spatial temporal graph convolutional networks for skeleton-based action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 7444-7452.
[5] SHI L, ZHANG Y, CHENG J, et al. Skeleton-based action recognition with directed graph neural networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
[6] SHI L, ZHANG Y, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 12026-12035.
[7] LI M, CHEN S, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3595-3603.
[8] CHEN Y, ZHANG Z, YUAN C, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 13359-13368.
[9] KIM H, MNIH A. Disentangling by factorising[C]//Proceedings of the 2018 International Conference on Machine Learning, 2018: 2649-2658.
[10] HU L, LIU S, FENG W. Spatial temporal graph attention network for skeleton-based action recognition[J]. arXiv:2208.
08599, 2022.
[11] LIU Z, ZHANG H, CHEN Z, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 143-152.
[12] 赵登阁, 智敏. 用于人体动作识别的多尺度时空图卷积算法[J]. 计算机科学与探索, 2023, 17(3): 719-732.
ZHAO D G, ZHI M. Spatial multiple-temporal graph convolutional neural network for human action recognition[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(3): 719-732.
[13] 丁益武. 基于图卷积网络的人体动作识别算法研究[D]. 南京: 南京信息工程大学, 2023.
DING Y W. Research on human motion recognition algorithm based on graph convolutional networks[D]. Nanjing: Nanjing University of Information Science and Technology, 2023.
[14] CHI H, HA M H, CHI S, et al. InfoGCN: representation learning for human skeleton-based action recognition[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 20186-20196.
[15] SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1010-1019.
[16] LIU J, SHAHROUDY A, PEREZ M, et al. NTU RGB+ D 120: a large-scale benchmark for 3D human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(10): 2684-2701.
[17] ZHANG P, LAN C, ZENG W, et al. Semantics-guided neural networks for efficient skeleton-based human action recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 1112-1121.
[18] DONG J, SUN S, LIU Z, et al. Hierarchical contrast for unsupervised skeleton-based action representation learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(1): 525-533.
[19] LIU J, SHAHROUDY A, XU D, et al. Spatio-temporal LSTM with trust gates for 3D human action recognition[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 816-833.
[20] LIU J, WANG G, DUAN L Y, et al. Skeleton-based human action recognition with global context-aware attention LSTM networks[J]. IEEE Transactions on Image Processing, 2017, 27(4): 1586-1599.
[21] LIN L, ZHANG J, LIU J. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 2363-2372.
[22] KE L, PENG K C, LYU S. Towards To-a-T spatio-temporal focus for skeleton-based action recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 1131-1139.