Action Recognition Method Based on Multi-Level Feature Fusion and Temporal Extension

doi:10.3778/j.issn.1002-8331.2111-0171

Abstract

Abstract: In recent years, action recognition based on graph convolutional network（GCN） has become a research hotspot in computer vision field. However, the existing GCN-based action recognition methods ignore motion features at the limb level, which makes the extraction of spatial behavior feature inaccurate. In addition, these methods lack the ability to perform temporal dynamic modeling between interval frames, resulting in insufficient expression of temporal behavior feature. To solve the above problems, an action recognition method based on GCN with multi-level feature fusion and temporal extension is proposed. In this method, the multi-level fusion module extracts and fuses low-level joint features and high-level limb features, so as to obtain more discriminative multi-level spatial features. At the same time, the temporal extension module learns rich multi-scale temporal features from adjacent frames and interval frames, which enhances the temporal expression of behavior features. Experimental results on three large datasets（NTU RGB+D 60, NTU RGB+D 120 and Kinetics-Skeleton） show that the recognition accuracy of the proposed method is higher than that of existing action recognition methods.

Key words: graph convolutional network（GCN）, action recognition, multi-level feature fusion, temporal extension

摘要： 近年来，基于图卷积网络的行为识别是计算机视觉领域的研究热点。然而，现有的图卷积行为识别方法忽略了肢体层面的动作特征，使得行为空间特征提取不准确。此外，这些方法缺乏在间隔帧间进行时序动态建模的能力，导致行为时域特征表达不充分。针对上述问题提出一种基于多级特征融合和时域扩展的图卷积网络行为识别方法。该方法通过多级融合模块提取与融合低层次的关节特征和高层次的肢体特征，从而得到判别性更强的多层级空间特征。同时通过时域扩展模块从相邻帧、间隔帧中学习丰富的多尺度时域特征，增强行为特征的时序表达。在三个大型数据集（NTU RGB+D 60、NTU RGB+D 120和Kinetics-Skeleton）上的实验结果表明，所提方法的识别准确度高于现有行为识别方法。

关键词: 图卷积网络, 行为识别, 多级特征融合, 时域扩展

WU Haoyuan, XIONG Xin, MIN Weidong, ZHAO Haoyu, WANG Wenxiang. Action Recognition Method Based on Multi-Level Feature Fusion and Temporal Extension[J]. Computer Engineering and Applications, 2023, 59(7): 134-142.

吴浩原, 熊辛, 闵卫东, 赵浩宇, 汪文翔. 基于多级特征融合和时域扩展的行为识别方法[J]. 计算机工程与应用, 2023, 59(7): 134-142.

References

[1] 梁绪，李文新，张航宁.人体行为识别方法研究综述[J].计算机应用研究，2022（3）：651-660.
LIANG X，LI W X，ZHANG H N.Review of human behavior recognition methods[J].Application Research of Computers，2022（3）：651-660.
[2] MIN W，FAN M，GUO X，et al.A new approach to track multiple vehicles with the combination of robust detection and two classifiers[J].IEEE Transactions on Intelligent Transportation Systems，2018，19（1）：174-186.
[3] XIONG X，MIN W，ZHENG W，et al.S3D-CNN：skeleton based 3D consecutive-low-pooling neural network for fall detection[J].Applied Intelligence，2020，50（10）：3521-3534.
[4] YANG H，LIU L，MIN W，et al.Driver yawning detection based on subtle facial action recognition[J].IEEE Transactions on Multimedia，2021，23：572-583.
[5] LI B，DAI Y，CHENG X，et al.Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN[C]//Proceedings of 2017 IEEE International Conference on Multimedia & Expo Workshops，2017：601-604.
[6] CAETANO C，SENA J，BRéMOND F，et al.SkeleMotion：a new representation of skeleton joint sequences based on motion information for 3D action recognition[C]//Proceedings of 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance，2019：1-8.
[7] CAETANO C，BRéMOND F，SCHWARTZ W R，et al.Skeleton image representation for 3D action recognition based on tree structure and reference joints[C]//Proceedings of the 32nd SIBGRAPI Conference on Graphics，Patterns and Images，2019：16-23.
[8] DU Y，FU Y，WANG L，et al.Representation learning of temporal dynamics for skeleton-based action recognition[J].IEEE Transactions on Image Processing，2016，25（7）：3010-3022.
[9] LIU J，WANG G，DUAN L，et al.Skeleton-based human action recognition with global context-aware attention lstm networks[J].IEEE Transactions on Image Processing，2018，27（4）：1586-1599.
[10] LEE I，KIM D，KANG S，et al.Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks[C]//Proceedings of 2017 IEEE International Conference on Computer Vision，2017：1012-1020.
[11] YAN S J，XIONG Y，LI D，et al.Spatial temporal graph convolutional networks for skeleton-based action[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence，2018：7444-7452.
[12] SHI L，ZHANG Y，CHENG J，et al.Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2019：12018-12027.
[13] CHEN Y，MA G，YUAN C，et al.Graph convolutional network with structure pooling and joint-wise channel attention for action recognition[J].Pattern Recognition，2020，103：107321.
[14] LIU Z，ZHANG H，CHEN Z，et al.Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：140-149.
[15] THAKKAR K，NARAYANAN P J.Part-based graph convolutional network for action recognition[C]//Proceedings of British Machine Vision Conference，2018.
[16] 刘锁兰，顾嘉晖，王洪元，等.基于关联分区和ST-GCN的人体行为识别[J].计算机工程与应用，2021，57（13）：168-175.
LIU S L，GU J H，WANG H Y，et al.Human behavior recognition based on associative partition and ST-GCN[J].Computer Engineering and Applications，2021，57（13）：168-175.
[17] SHAHROUDY A，LIU J，NG T，et al.NTU RGB+D：a large scale dataset for 3D human activity analysis[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：1010-1019.
[18] CARREIRA J，ZISSERMAN A.Quo Vadis，action recognition? A new model and the kinetics dataset[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：4724-4733.
[19] LIU J，SHAHROUDY A，PEREZ M，et al.NTU RGB+D 120：a large-scale benchmark for 3D human activity understanding[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2020，42（10）：2684-2701.
[20] LI C，ZHONG Q，XIE D，et al.Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence，2018：786-792.
[21] ZHANG P，LAN C，XING J，et al.View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]//Proceedings of 2017 IEEE International Conference on Computer Vision，2017：2136-2145.
[22] ZHENG W，LI L，ZHANG Z，et al.Relational network for skeleton-based action recognition[C]//Proceedings of 2019 IEEE International Conference on Multimedia and Expo，2019：826-831.
[23] SI C，CHEN W，WANG W，et al.An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：1227-1236.
[24] LI M，CHEN S，CHEN X，et al.Actional-structural graph convolutional networks for skeleton-based action Recognition[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：3590-3598.
[25] 苏江毅，宋晓宁，吴小俊，等.多模态轻量级图卷积人体骨架行为识别方法[J].计算机科学与探索，2021，15（4）：733-742.
SU J Y，SONG X N，WU X J，et al.Skeleton based action recognition algorithm on multi-modal lightweight graph convolutional network[J].Journal of Frontiers of Computer Science and Technology，2021，15（4）：733-742.
[26] SHI L，ZHANG Y，J CHENG，et al.Skeleton-based action recognition with directed graph neural networks[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：7904-7913.
[27] PENG W，HONG X，CHEN H，et al.Learning graph convolutional network for skeleton-based human action recognition by neural searching[C]//Proceedings of AAAI Conference on Artificial Intelligence，2020：2669-2676.
[28] PENG W，SHI J，ZHAO G，et al.Spatial temporal graph deconvolutional network for skeleton-based human action recognition[J].IEEE Signal Processing Letters，2021，28：244-248.
[29] KE Q H，BENNAMOUN M，AN S J，et al.Learning clip representations for skeleton-based 3D action recognition[J].IEEE Transactions on Image Processing，2018，27（6）：2842-2855.
[30] LIU J，SHAHROUDY A，XU D，et al.Spatio-temporal LSTM with trust gates for 3D human action recognition[C]//Proceedings of 14th European Conference on Computer Vision，2016：816-833.
[31] SONG Y，ZHANG Z，SHAN C，et al.Richly activated graph convolutional network for robust skeleton-based action recognition[J].IEEE Transactions on Circuits and Systems for Video Technology，2021，31（5）：1915-1925.
[32] KIM T S，REITER A.Interpretable 3D human action analysis with temporal convolutional networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops，2017：1623-1631.