计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (18): 137-144.DOI: 10.3778/j.issn.1002-8331.2206-0322

• 模式识别与人工智能 • 上一篇    下一篇

结合自适应图卷积与时态建模的骨架动作识别

甄昊宇,张德   

  1. 北京建筑大学 电气与信息工程学院 & 建筑大数据智能处理方法研究北京市重点实验室,北京 100044
  • 出版日期:2023-09-15 发布日期:2023-09-15

Combining Adaptive Graph Convolution and Temporal Modeling for Skeleton-Based Action Recognition

ZHEN Haoyu, ZHANG De   

  1. School of Electrical and Information Engineering & Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
  • Online:2023-09-15 Published:2023-09-15

摘要: 图卷积神经网络在基于三维骨架数据的人体动作识别中得到了广泛的应用,自适应图卷积可以有效地学习和反映不同动作数据内部的相对位置关系,用于提取空间特征。在时间特征方面,多数方法通过叠加多层一维局部卷积来提取相邻时间步长之间的时间关系,而忽略了非相邻时间步长的关键时间信息。因此,提出一种结合自适应图卷积与多尺度时态建模的动作识别模型。其中,自适应图卷积以端到端的方式学习不同卷积层和数据样本的图拓扑结构,增加了图建模的灵活性;多尺度时态建模构建相邻时间步长和非相邻时间步长之间的时态关系,充分提取了骨架序列的时间动态特征。结果表明,与主流算法相比,该模型在NTU RGB+D和NTU RGB+D 120基准数据集上的准确率均有较大提升。

关键词: 人体骨架, 动作识别, 自适应图卷积, 多尺度时态建模

Abstract: Graph convolutional neural network has been widely used in skeleton-based human action recognition. Adaptive graph convolution can significantly learn and reflect the internal relative position relationship of different action data, and is used to extract spatial features. In terms of temporal features, most methods extract the time relationship between adjacent time steps by superimposing multi-layer one-dimensional local convolution, while ignoring the key time information of non-adjacent time steps. Therefore, this paper proposes a network model combining adaptive graph convolution and multi-scale temporal modeling. The adaptive graph convolution learns the graph topology of different convolution layers and data samples in an end-to-end manner, which increases the flexibility of graph modeling. Multi-scale temporal modeling constructs the temporal relationship between adjacent time steps and non-adjacent time steps, and fully extracts the time dynamic characteristics of skeleton sequences. The results show that compared with the mainstream algorithms, the accuracy on NTU-RGB+D and NTU-RGB+D 120 benchmark datasets is effectively improved.

Key words: human skeleton, action recognition, adaptive graph convolution, multi-scale temporal modeling