Combining Adaptive Graph Convolution and Temporal Modeling for Skeleton-Based Action Recognition

doi:10.3778/j.issn.1002-8331.2206-0322

Abstract

Abstract: Graph convolutional neural network has been widely used in skeleton-based human action recognition. Adaptive graph convolution can significantly learn and reflect the internal relative position relationship of different action data, and is used to extract spatial features. In terms of temporal features, most methods extract the time relationship between adjacent time steps by superimposing multi-layer one-dimensional local convolution, while ignoring the key time information of non-adjacent time steps. Therefore, this paper proposes a network model combining adaptive graph convolution and multi-scale temporal modeling. The adaptive graph convolution learns the graph topology of different convolution layers and data samples in an end-to-end manner, which increases the flexibility of graph modeling. Multi-scale temporal modeling constructs the temporal relationship between adjacent time steps and non-adjacent time steps, and fully extracts the time dynamic characteristics of skeleton sequences. The results show that compared with the mainstream algorithms, the accuracy on NTU-RGB+D and NTU-RGB+D 120 benchmark datasets is effectively improved.

Key words: human skeleton, action recognition, adaptive graph convolution, multi-scale temporal modeling

摘要： 图卷积神经网络在基于三维骨架数据的人体动作识别中得到了广泛的应用，自适应图卷积可以有效地学习和反映不同动作数据内部的相对位置关系，用于提取空间特征。在时间特征方面，多数方法通过叠加多层一维局部卷积来提取相邻时间步长之间的时间关系，而忽略了非相邻时间步长的关键时间信息。因此，提出一种结合自适应图卷积与多尺度时态建模的动作识别模型。其中，自适应图卷积以端到端的方式学习不同卷积层和数据样本的图拓扑结构，增加了图建模的灵活性；多尺度时态建模构建相邻时间步长和非相邻时间步长之间的时态关系，充分提取了骨架序列的时间动态特征。结果表明，与主流算法相比，该模型在NTU RGB+D和NTU RGB+D 120基准数据集上的准确率均有较大提升。

关键词: 人体骨架, 动作识别, 自适应图卷积, 多尺度时态建模

ZHEN Haoyu, ZHANG De. Combining Adaptive Graph Convolution and Temporal Modeling for Skeleton-Based Action Recognition[J]. Computer Engineering and Applications, 2023, 59(18): 137-144.

甄昊宇, 张德. 结合自适应图卷积与时态建模的骨架动作识别[J]. 计算机工程与应用, 2023, 59(18): 137-144.

References

[1] 钱慧芳，易剑平，付云虎.基于深度学习的人体动作识别综述[J].计算机科学与探索，2021，15（3）：438-455.
QIAN H F，YI J P，FU Y H.Review of human action recognition based on deep learning[J].Journal of Frontiers of Computer Science and Technology，2021，15（3）：438-455.
[2] 张友梅，常发亮，刘洪彬.基于3D人体骨架的动作识别[J].电子学报，2017，45（4）：906-911.
ZHANG Y M，CHANG F L，LIU H B.Action recognition based on 3D skeleton[J].Acta Electronica Sinica，2017，45（4）：906-911.
[3] WANG J，LIU Z，WU Y，et al.Mining actionlet ensemble for action recognition with depth cameras[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2012：1290-1297.
[4] WANG C Y，WANG Y Z，YUILLE A L.An approach to pose-based action recognition[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，Portland，2013：915-922.
[5] YANG X D，TIAN Y L.EigenJoints-based action recognition using Na?ve-Bayes-nearest-neighbor[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops，2012：14-19.
[6] VEMULAPALLI R，ARRATE F，CHELLAPPA R.Human action recognition by representing 3D skeletons as points in a Lie group[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2014：588-595.
[7] FERNANDO B，GAVVES E，ORAMAS J M，et al.Modeling video evolution for action recognition[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2015：5378-5387.
[8] DU Y，WANG W，WANG L.Hierarchical recurrent neural network for skeleton based action recognition[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2015：1110-1118.
[9] LI S，LI W，COOK C，et al.Independently recurrent neural network（IndRNN）：building a longer and deeper RNN[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2018：5457-5466.
[10] LIU J，SHAHROUDY A，XU D，et al.Spatio-temporal LSTM with trust gates for 3D human action recognition[C]//European Conference on Computer Vision，2016：816-833.
[11] LIU J，WANG G，DUAN L Y，et al.Skeleton-based human action recognition with global context-aware attention LSTM networks[J].IEEE Transactions on Image Processing，2018，27（4）：1586-1599.
[12] LI B，DAI Y，CHENG X，et al.Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN[C]//Proceedings of IEEE International Conference on Multimedia and Expo Workshops，2017：601-604.
[13] KE Q，BENNAMOUN M，AN S，et al.A new representation of skeleton sequences for 3D action recognition[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2017：3288-3297.
[14] KE Q，BENNAMOUN M，AN S，et al.Learning clip representations for skeleton-based 3D action recognition[J].IEEE Transactions on Image Processing，2018，27（6）：2842-2855.
[15] YAN S，XIONG Y，LIN D.Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of AAAI Conference on Artificial Intelligence，2018：7444-7452.
[16] LI M，CHEN S，CHEN X，et al.Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2019：3595-3603.
[17] ZHANG P，LAN C，ZENG W，et al.Semantics-guided neural networks for efficient skeleton-based human action recognition[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2020：1112-1121.
[18] SHI L，ZHANG Y，CHENG J，et al.Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2019：12026-12035.
[19] CHEN T L，ZHOU D S，WANG J，et al.Learning multi-granular spatio-temporal graph network for skeleton-based action recognition[C]//Proceedings of ACM International Conference on Multimedia，2021：4334-4342.
[20] 马利，郑诗雨，牛斌.应用区域关联自适应图卷积的动作识别方法[J].计算机科学与探索，2022，16（4）：898-908.
MA L，ZHENG S Y，NIU B.Action recognition method on regional association adaptive graph convolution[J].Journal of Frontiers of Computer Science and Technology，2022，16（4）：898-908.
[21] 刘芳，乔建忠，代钦，等.基于双流多关系GCNs的骨架动作识别方法[J].东北大学学报（自然科学版），2021，42（6）：768-774.
LIU F，QIAO J Z，DAI Q，et al.Skeleton-based action recognition method with two-stream multi-relational GCNs[J].Journal of Northeastern University（Natural Science），2021，42（6）：768-774.
[22] SONG Y F，ZHANG Z，SHAN C，et al.Richly activated graph convolutional network for robust skeleton-based action recognition[J].IEEE Transactions on Circuits and Systems for Video Technology，2020，31（5）：1915-1925.
[23] LI J N，XIE X M，ZHAO Z F，et al.Temporal graph modeling for skeleton-based action recognition[J].arXiv：2012.08804，2020.
[24] WANG X，GIRSHICK R，GUPTA A，et al.Non-local neural networks[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2018：7794-7803.
[25] SHAHROUDY A，LIU J，NG T T，et al.NTU RGB+D：a large scale dataset for 3D human activity analysis[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2016：1010-1019.
[26] LIU J，SHAHROUDY A，PEREZ M，et al.NTU RGB+D 120：a large-scale benchmark for 3D human activity understanding[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2019，42（10）：2684-2701.
[27] 李炫烨，郝兴伟，贾金公，等.结合多注意力机制与时空图卷积网络的人体动作识别方法[J].计算机辅助设计与图形学学报，2021，33（7）：1055-1063.
LI X Y，HAO X W，JIA J G，et al.Human action recognition method based on multi-attention mechanism and spatio-temporal graph convolution networks[J].Journal of Computer-Aided Design & Computer Graphics，2021，33（7）：1055-1063.