Spatial-Temporal Convolutional Attention Network for Action Recognition

doi:10.3778/j.issn.1002-8331.2112-0579

Abstract

Abstract: In the task of video action recognition, whether in the spatial dimension or temporal dimension of video, how to fully learn and make use of the correlation between features has a great impact on the final recognition performance. Convolution obtains local features by calculating the correlation between feature points in the neighborhood, and self-attentional mechanism learns global information through the information interaction between all feature points. A single convolutional layer does not have the ability to learn feature correlation from the global perspective. Even repeated stacking of multiple layers only obtains several larger receptive fields. Although the self-attention layer has a global perspective, its focus is only the content relationship expressed by different feature points, ignoring the local location characteristics. In order to solve the above problems, a spatial-temporal convolutional attention network is proposed for action recognition. Spatial-temporal convolutional attention network is composed of spatial convolutional attention network and temporal convolutional attention network. Spatial convolutional attention network uses self-attention method to capture the apparent feature relationship of spatial dimension, and uses one-dimensional convolution to extract dynamic information. The temporal convolutional attention network obtains the correlation information between frame level features in the temporal dimension through the self-attention method, and uses 2D convolution to learn spatial features. Spatial-temporal convolutional attention network integrates the common test results of the two networks to improve the performance of model recognition. Experiments are carried out on HMDB51 data set. Taking ResNet50 as the baseline and introducing spatial-temporal convolutional attention module, the recognition accuracy of neural network is improved by 6.25 and 5.13?percentage points in spatial flow and temporal flow respectively. Compared with the current advanced methods, spatial-temporal convolutional attention network has obvious advantages in UCF101 and HMDB51 datasets. The spatial-temporal convolutional attention network proposed in this paper can effectively capture feature correlation information. This method combines the advantages of self-attention global connection and convolution local connection, and improves the spatial-temporal modeling ability of neural network.

Key words: action recognition, deep learning, feature fusion, self attention mechanism, convolutional network

摘要： 在视频动作识别任务中，无论是在视频的空间维度还是时序维度，如何充分学习和利用特征之间相关性，对最终识别性能的影响非常大。卷积操作通过计算邻域内特征点之间的相关性获得局部特征，而自注意力机制通过所有特征点之间的信息交互学习到全局信息。单个卷积层不具备在全局视角上学习特征相关性的能力，即使是重复堆叠多层也只是获得了若干个更大的感受野。自注意力层虽然具有全局视角，但其关注的核心仅是不同特征点所表达的内容联系，忽略了局部的位置特性。为了解决以上问题，提出了一种时空卷积注意力网络用于动作识别。时空卷积注意力网络由空间卷积注意力网络和时序卷积注意力网络共同组成。空间卷积注意力网络使用自注意力方法捕捉空间维度的表观特征联系，用一维卷积提取动态信息。时序卷积注意力网络通过自注意力方法来获取时序维度上帧级特征间的关联信息，用2D卷积学习空间特征。时空卷积注意力网络集成两种网络的共同测试结果来提升模型识别性能。在HMDB51数据集上进行实验，以ResNet50为基线，引入时空卷积注意力模块后，神经网络的识别准确率在空间流和时序流上分别提升了6.25和5.13个百分点。与当前先进方法进行比较，时空卷积注意力网络在UCF101和HMDB51数据集上均有明显优势。提出的时空卷积注意力网络能够有效地捕捉特征关联信息，该方法结合自注意全局联系和卷积局部联系的优势，提升了神经网络时空建模的能力。

关键词: 动作识别, 深度学习, 特征融合, 自注意力机制, 卷积网络

LUO Huilan, CHEN Han. Spatial-Temporal Convolutional Attention Network for Action Recognition[J]. Computer Engineering and Applications, 2023, 59(9): 150-158.

罗会兰, 陈翰. 时空卷积注意力网络用于动作识别[J]. 计算机工程与应用, 2023, 59(9): 150-158.

References

[1] WANG L，HUYNH D Q，KONIUSZ P.A comparative review of recent kinect-based action recognition algorithms[J].IEEE Transactions on Image Processing，2020，29：15-28.
[2] ZHAO H，WILDES R P.Review of video predictive understanding：early action recognition and future action prediction[J].arXiv：2107.05140，2021.
[3] SUN Z，LIU J，KE Q，et al.Human action recognition from various data modalities：a review[J].arXiv：2012. 11866，2020.
[4] SIMONYAN K，ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems，2014：568-576.
[5] WANG L，XIONG Y，WANG Z，et al.Temporal segment networks：towards good practices for deep action recognition[C]//European Conference on Computer Vision，2016：20-36.
[6] LAN Z Z，ZHU Y，HAUPTMANN A G.Deep local video feature for action recognition[C]//Computer Vision and Pattern Recognition Workshops（CVPRW），2017：1219-1225.
[7] SUN S，KUANG Z，SHENG L，et al.Optical flow guided feature：a fast and robust motion representation for video action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2018：1390-1399.
[8] PIERGIOVANNI A，RYOO M.Representation flow for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：9945-9953.
[9] JI S，XU W，YANG M，et al.3D convolutional neural networks for human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2013，35（1）：221-231.
[10] TRAN D，BOURDEV L，FERGUS R，et al.Learning spatiotemporal features with 3D convolutional networks[C]//IEEE International Conference on Computer Vision（ICCV），2015：4489-4497.
[11] LIU K，LIU W，GAN C，et al.T-C3D：Temporal convolutional 3D network for real-time action recognition[C]//AAAI Conference on Artificial Intelligence，2018：7138-7145.
[12] CARREIRA J，ZISSERMAN A.Quo vadis，action recognition? a new model and the kinetics dataset[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2017：4724-4733.
[13] TRAN D，WANG H，TORRESANI L，et al.A closer look at spatiotemporal convolutions for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2018：6450-6459.
[14] QIU Z，YAO T，MEI T.Learning spatio-temporal representation with pseudo-3D residual networks[C]//IEEE International Conference on Computer Vision（ICCV），2017：5534-5542.
[15] ZHOU B，ANDONIAN A，OLIVA A，et al.Temporal relational reasoning in videos[C]//European Conference on Computer Vision，2018：831-846.
[16] LIN J，GAN C，HAN S.TSM：temporal shift module for efficient video understanding[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：7082-7092.
[17] ZHOU Y，SUN X，LUO C，et al.Spatiotemporal fusion in 3D CNNs：a probabilistic view[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：9826-9835.
[18] TAO L，WANG X，YAMASAKI T.Rethinking motion representation：residual frames with 3D ConvNets for better action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：5667-5678.
[19] XU K，BA J，KIROS R，et al.Show，attend and tell：neural image caption generation with visual attention[J].arXiv：1502.03044v2，2015.
[20] JIE H，LI S，GANG S.Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2018：7132-7141.
[21] SANGHYUN W，PARK J，YOUNG L J，et al.CBAM：convolutional block attention module[C]//European Conference on Computer Vision，2018：3-19.
[22] CARION N，MASSA F，SYNNAEVE G，et al.End-to-end object detection with transformers[C]//European Conference on Computer Vision，2020：213-229.
[23] GIRDHAR R，CARREIRA J，DOERSCH C，et al.Video action transformer network[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：244-253.
[24] LONG X，GAN C，DE MELO G，et al.Multimodal keyless attention fusion for video classification[C]//AAAI Conference on Artificial Intelligence，2018：7202-7209.
[25] BERTASIUS G，WANG H，TORRESANI L.Is space-time attention all you need for video understanding?[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：813-824.
[26] SOOMRO K，ZAMIR A R，SHAH M.UCF101：a dataset of 101 human actions classes from videos in the wild[J].arXiv：1212.0402，2012.
[27] KUEHNE H，JHUANG H，GARROTE E，et al.HMDB：a large video database for human motion recognition[C]//IEEE International Conference on Computer Vision（ICCV），2011：2556-2563.
[28] VAROL G，LAPTEV I，SCHMId C.Long-term temporal convolutions for action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2018，40：1510-1517.
[29] LIU Z，YE T，WANG Z.Improving human action recognitionby temporal attention[C]//2017 IEEE International Conference on Image Processing（ICIP），2018：870-874.
[30] ZHOU Y，SUN X，ZHA Z J，et al.MiCT：mixed 3D/2D convolutional tube for human action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2018：449-458.
[31] 罗会兰，童康.时空压缩激励残差乘法网络的视频动作识别[J].通信学报，2019，40（9）：189-198.
LUO H L，TONG K，YUAN P.Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition[J].Journal on Communications，2019，40（9）：189-198.
[32] DIBA A，FAYYAZ M，SHARMA V，et al.Spatio-temporal channel correlation networks for action classification[C]//European Conference on Computer Vision，2019：299-315.
[33] TU Z，XIE W，DAUWELS J，et al.Semantic cues enhanced multimodality multistream CNN for action recognition[J].IEEE Transactions on Circuits and Systems for Video Technology，2019，29（5）：1423-1437.
[34] JIANG M，PAN N，KONG J.Spatial-temporal saliency action mask attention network for action recognition[J].Journal of Visual Communication and Image Representation，2020，71：102846.
[35] MING Y，FENG F，LI C，et al.3D-TDC：A 3D temporal dilation convolution framework for video action recognition[J].Neurocomputing，2021，450：362-371.
[36] WANG Y，LIU W，XING W.Improved two-stream network for action recognition in complex scenes[C]//Artificial Intelligence and Electromechanical Automation，2021：361-365.
[37] KARPATHY A，TODERICI G，SHETTY S，et al.Large-scale video classification with convolutional neural networks[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition，2014：1725-1732.