混合多通道联合学习和双分支注意力融合的动作识别

doi:10.3778/j.issn.1002-8331.2312-0036

摘要/Abstract

摘要： 针对现有骨架动作识别方法对不同通道之间的时空特征提取不充分，以及难以充分融合不同尺度特征的问题，提出混合多通道联合学习和双分支注意力融合的动作识别模型。通过构建混合多通道图拓扑结构，联合学习关节在不同通道之间的相似性和差异性，从而实现了对不同通道之间的时空特征提取。同时，提出接受域多样化的双分支注意力融合模块，通过注意力机制动态分配局部和全局特征权重以实现不同尺度信息之间的上下文相关性融合。该模型在两个公共数据集NTU-RGB+D 60和NTU-RGB+D 120上进行了多组对比实验。实验结果表明，在NTU-RGB+D 60和NTU-RGB+D 120数据集上的分类准确率分别达到了96.5%和90.7%。

关键词: 动作识别, 混合多通道特征聚合, 注意力融合

Abstract: Aiming at the problems that existing skeleton action recognition methods are not enough to extract spatio-temporal features between different channels, and it is difficult to fully integrate features of different scales, a new algorithm framework based on hybrid multi-channel associated learning and two-branch attention fusion is proposed. By constructing a hybrid multi-channel graph topology, joint learning of the similarities and differences between joints in different channels is achieved, achieving spatio-temporal feature extraction between different channels. A double branch attention fusion module is proposed, which dynamically allocates local and global feature weights through attention mechanisms to achieve contextual fusion between information at different scales. This model underwent multiple comparative experiments on two datasets, NTU-RGB+D 60 and NTU-RGB+D 120. Several comparison experiments are conducted on two large scale datasets of NTU-RGB+D 60 and NTU-RGB+D 120, and their accuracy reaches 96.5% and 90.7%, respectively.

Key words: action recognition, hybrid multi-channel, attentional fusion

卢少同, 王传旭. 混合多通道联合学习和双分支注意力融合的动作识别[J]. 计算机工程与应用, 2025, 61(8): 145-154.

LU Shaotong, WANG Chuanxu. Hybrid Multi-Channel Associated Learning and Two-Branch Attention Fusion for Action Recognition[J]. Computer Engineering and Applications, 2025, 61(8): 145-154.

参考文献

[1] WANG L, KONIUSZ P. Self-supervising action recognition by statistical moment and subspace descriptors[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 4324-4333.
[2] DHIMAN C, VISHWAKARMA D K, AGGARWAL P. Skeleton based activity recognition by fusing part-wise spatio-temporal and attention driven residues[J]. arXiv:1912.00576, 2019.
[3] KONIUS P, WANG L, CHERIAN A. Tensor representations for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(2): 648-665.
[4] 毕春艳, 刘越. 基于深度学习的视频人体动作识别综述[J]. 图学学报, 2023, 44(4): 625-639.
BI C Y, LIU Y. A survey of video human action recognition based on deep learning[J]. Journal of Graphics, 2023, 44(4): 625-639.
[5] YAN S, XIONG Y, LIN D. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018: 7444-7452.
[6] QIN Z, LIU Y, JI P, et al. Fusing higher-order features in graph neural networks for skeleton-based action recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(4): 4783-4797.
[7] LI C, ZHONG Q, XIE D, et al. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation[J]. arXiv:1804.06055, 2018.
[8] CHEN Y, ZHANG Z, YUAN C, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 13359-13368.
[9] 王仕宸, 黄凯, 陈志刚, 等. 深度学习的三维人体姿态估计综述[J]. 计算机科学与探索, 2023, 17(1): 74-87.
WANG S C, HUANG K, CHEN Z G, et al. A survey of 3D human pose estimation based on deep learning[J]. Journal of Frontiers of Computer Science & Technology, 2023, 17(1): 74-87.
[10] SHI L, ZHANG Y, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 12026-12035.
[11] ZHANG J, TU Z, YANG J, et al. MixSTE: Seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 13232-13242.
[12] YE F, PU S, ZHONG Q, et al. Dynamic GCN: context enriched topology learning for skeleton-based action recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 55-63.
[13] LIU Z, ZHANG H, CHEN Z, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 143-152.
[14] LEE J, LEE M, LEE D, et al. Hierarchically decomposedgraph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 10444-10453.
[15] 曹毅, 吴伟官, 李平, 等. 基于时空特征增强图卷积网络的骨架行为识别[J]. 电子与信息学报, 2023, 45(8): 3022-3031.
CAO Y, WU W G, LI P, et al. Skeleton action recognition based on spatio-temporal feature enhanced graph convolutional network[J]. Journal of Electronics & Information Technology, 2023, 45(8): 3022-3031.
[16] PLIZZARI C, CANNICI M, MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208: 103219.
[17] ZHANG Y, WU B, LI W, et al. STST: spatial-temporal specialized transformer for skeleton-based action recognition[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 3229-3237.
[18] SHI L, ZHANG Y, CHENG J, et al. Decoupled spatial-temporal attention network for skeleton-based action recognition[J]. arXiv:2007.03263, 2020.
[19] 郭宗洋, 刘立东, 蒋东华, 等. 基于语义引导神经网络的人体动作识别算法[J]. 图学学报, 2024, 45(1): 26-34.
GUO Z, LI L, JIANG D, et al. Human action recognition algorithm based on semantics-guided neural networks[J]. Journal of Graphics, 2024, 45(1): 26-34.
[20]赵登阁, 智敏. 用于人体动作识别的多尺度时空图卷积算法[J]. 计算机科学与探索, 2023, 17(3): 719-732.
ZHAO D G, ZHI M. A multi-scale spatio temporal graph convolution algorithm for human action recognition[J]. Journal of Frontiers of Computer Science & Technology, 2023, 17(3): 719-732.
[21] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[22] SONG Y F, ZHANG Z, SHAN C, et al. Constructing stronger and faster baselines for skeleton-based action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022: 1474-1488.
[23] CHENG K, ZHANG Y, HE X, et al. Extremely lightweight skeleton-based action recognition with shiftGCN++[J]. IEEE Transactions on Image Processing, 2021(30): 7333-7348.
[24] XU K, YE F, ZHONG Q, et al. Topology-aware convolutional neural network for efficient skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 2866-2874.
[25] SI C, CHEN W, WANG W, et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1227-1236.
[26] DUAN H, WANG J, CHEN K, et al. DG-STGCN: dynamic spatial-temporal modeling for skeleton-based action recognition[J]. arXiv:2210.05895, 2022.
[27] WANG L, KOUNIUSZ P. 3Mformer: multi-order multi-mode transformer for skeletal action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 5620-5631.