Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (9): 150-158.DOI: 10.3778/j.issn.1002-8331.2112-0579

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Spatial-Temporal Convolutional Attention Network for Action Recognition

LUO Huilan, CHEN Han   

  1. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China
  • Online:2023-05-01 Published:2023-05-01

时空卷积注意力网络用于动作识别

罗会兰,陈翰   

  1. 江西理工大学 信息工程学院,江西 赣州 341000

Abstract: In the task of video action recognition, whether in the spatial dimension or temporal dimension of video, how to fully learn and make use of the correlation between features has a great impact on the final recognition performance. Convolution obtains local features by calculating the correlation between feature points in the neighborhood, and self-attentional mechanism learns global information through the information interaction between all feature points. A single convolutional layer does not have the ability to learn feature correlation from the global perspective. Even repeated stacking of multiple layers only obtains several larger receptive fields. Although the self-attention layer has a global perspective, its focus is only the content relationship expressed by different feature points, ignoring the local location characteristics. In order to solve the above problems, a spatial-temporal convolutional attention network is proposed for action recognition. Spatial-temporal convolutional attention network is composed of spatial convolutional attention network and temporal convolutional attention network. Spatial convolutional attention network uses self-attention method to capture the apparent feature relationship of spatial dimension, and uses one-dimensional convolution to extract dynamic information. The temporal convolutional attention network obtains the correlation information between frame level features in the temporal dimension through the self-attention method, and uses 2D convolution to learn spatial features. Spatial-temporal convolutional attention network integrates the common test results of the two networks to improve the performance of model recognition. Experiments are carried out on HMDB51 data set. Taking ResNet50 as the baseline and introducing spatial-temporal convolutional attention module, the recognition accuracy of neural network is improved by 6.25 and 5.13?percentage points in spatial flow and temporal flow respectively. Compared with the current advanced methods, spatial-temporal convolutional attention network has obvious advantages in UCF101 and HMDB51 datasets. The spatial-temporal convolutional attention network proposed in this paper can effectively capture feature correlation information. This method combines the advantages of self-attention global connection and convolution local connection, and improves the spatial-temporal modeling ability of neural network.

Key words: action recognition, deep learning, feature fusion, self attention mechanism, convolutional network

摘要: 在视频动作识别任务中,无论是在视频的空间维度还是时序维度,如何充分学习和利用特征之间相关性,对最终识别性能的影响非常大。卷积操作通过计算邻域内特征点之间的相关性获得局部特征,而自注意力机制通过所有特征点之间的信息交互学习到全局信息。单个卷积层不具备在全局视角上学习特征相关性的能力,即使是重复堆叠多层也只是获得了若干个更大的感受野。自注意力层虽然具有全局视角,但其关注的核心仅是不同特征点所表达的内容联系,忽略了局部的位置特性。为了解决以上问题,提出了一种时空卷积注意力网络用于动作识别。时空卷积注意力网络由空间卷积注意力网络和时序卷积注意力网络共同组成。空间卷积注意力网络使用自注意力方法捕捉空间维度的表观特征联系,用一维卷积提取动态信息。时序卷积注意力网络通过自注意力方法来获取时序维度上帧级特征间的关联信息,用2D卷积学习空间特征。时空卷积注意力网络集成两种网络的共同测试结果来提升模型识别性能。在HMDB51数据集上进行实验,以ResNet50为基线,引入时空卷积注意力模块后,神经网络的识别准确率在空间流和时序流上分别提升了6.25和5.13个百分点。与当前先进方法进行比较,时空卷积注意力网络在UCF101和HMDB51数据集上均有明显优势。提出的时空卷积注意力网络能够有效地捕捉特征关联信息,该方法结合自注意全局联系和卷积局部联系的优势,提升了神经网络时空建模的能力。

关键词: 动作识别, 深度学习, 特征融合, 自注意力机制, 卷积网络