计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (22): 226-234.DOI: 10.3778/j.issn.1002-8331.2408-0007

• 图形图像处理 • 上一篇    下一篇

GCN与CNN融合的特征细化骨骼行为识别方法

陈星启,宋涛,邹洋杨   

  1. 1.重庆理工大学 两江人工智能学院,重庆 401135 
    2.重庆理工大学 电气与电子工程学院,重庆 400054
  • 出版日期:2025-11-15 发布日期:2025-11-14

Feature Refinement Skeletal Action Recognition Method Based on GCN and CNN Fusion

CHEN Xingqi, SONG Tao, ZOU Yangyang   

  1. 1.School of Artificial Intelligence, Chongqing University of Technology, Chongqing 401135, China
    2.School of Electrical and Electronic Engineering, Chongqing University of Technology, Chongqing 400054, China
  • Online:2025-11-15 Published:2025-11-14

摘要: 图卷积网络(graph convolutional network,GCN)因有效保留骨骼特征能力的优势,在当前主流研究中得到广泛应用并取得显著效果。然而,固定时序卷积核的大小使得时间卷积过程中感受野受限,并且图卷积过程中对骨骼特征信息提取不充分、跨尺度特征细化、多层语义特征连接问题还需要进一步解决。针对这些问题,设计了一个融合网络,既用到了GCN能够保留骨骼特征的优势,又用到了卷积神经网络(convolutional neural network,CNN)较强的空间特征提取能力。在该网络中多分支时序增强卷积(multi-branch temporal enhanced convolution,MTE Conv)设置了不同的分支以及时序增强获得更为丰富的跨尺度细化特征。图顶点增强模块(graph vertex enhanced module,GVEM)作为GCN与CNN之间进行多层次语义特征连接,使得图骨骼特征能够更好地映射到CNN中进行时空特征的提取。在NTU-RGB+D 60的X-view与NTU-RGB+D 120的X-set两大基准上取得了97.63%和91.16%的准确率,表明所提出的方法具有优越的性能。

关键词: 行为识别, 骨骼数据, 特征融合, 图卷积神经网络(GCN)

Abstract: Graph convolutional networks (GCNs) have been widely applied and have achieved significant results in current mainstream research due to their ability to effectively capture skeleton features. However, the size of the fixed temporal convolution kernel limits the receptive field in the temporal convolution process, and the problems of insufficient extraction of bone feature information, cross-scale feature refinement, and multi-layer semantic feature connection in the graph convolution process need to be further solved. Aiming at these problems, a fusion network is designed, which utilizes the advantage of GCN in retaining skeleton features and the strong ability of convolutional neural network (CNN) in extracting spatial features. In the network, the multi-branch temporal enhanced convolution (MTE Conv) is set up with different branches and temporal enhancement to obtain more diverse fine-grained features at different scales. The graph vertex enhanced module (GVEM) serves as a multi-level semantic feature connection between GCN and CNN, enabling the graph skeleton features to be better mapped to CNN for spatial-temporal feature extraction. The accuracy of 97.63% and 91.16% is achieved on the X-view of NTU-RGB+D 60 and X-set of NTU-RGB+D 120, indicating that the proposed method has superior performance.

Key words: action recognition, skeletal data, feature fusion, graph convolutional network (GCN)