Efficient 2D Temporal Modeling Network for Video Action Recognition

doi:10.3778/j.issn.1002-8331.2108-0016

Abstract

Abstract: 2D convolution is difficult to perform effective temporal information modeling for video data. To address this issue, an efficient 2D convolution-based temporal modeling network is proposed. The network only needs RGB images as input, avoiding complicated optical flow calculations, and can achieve advanced accuracy in behavior recognition tasks under the premise of low computational complexity. The network is mainly composed of two parts, named the motion feature enhancement module and the temporal aggregation module. Concretely, the motion feature enhancement module mainly focus on short-term temporal modeling, which adaptively enhances the motion information in the current frame using the difference information between the current frame and adjacent frames, and allows the network to understand which part of the image is about to generate motion. The temporal aggregation module implements long-term temporal modeling, which is mainly applied in the later stage of the network. It aggregates the information on the temporal sequence by 2D convolution, so that each frame can combine the information of all frames in the temporal dimension after the features are extracted by the network. Finally, extensive experiments conducted on three common video action recognition datasets（UCF101, HMDB51 and Something-Something V1） demonstrate that the proposed temporal modeling network can obtain advanced recognition performance compared with most of the existing methods.

Key words: short-term motion feature enhancement, long-term temporal aggregation, temporal modeling, 2D convolutional network, action recognition

摘要： 二维卷积难以对视频数据进行有效的时间信息建模。针对这个问题，提出了一个高效的基于二维卷积的时间建模网络。该网络只需要RGB图像作为输入，避免了复杂的光流计算，在低计算复杂度的前提下，可以在行为识别任务中达到先进的准确性。网络主要由两个部分组成，即运动特征增强模块和时序聚集模块。具体来说，运动特征增强模块主要实现短期时序建模，它利用当前帧与相邻帧的差异信息对当前帧中的运动信息进行自适应性的增强，让网络能够了解图像中的哪一部分将要产生运动。时序聚集模块实现长期的时序建模，主要应用于网络的后期，通过二维卷积对时序上的信息进行信息聚合，让每一帧图像经过网络提取特征后，都能够结合时序上所有帧序列的信息。在三个常见的视频动作识别数据集（UCF101、HMDB51和Something-Something V1）上进行的大量实验表明，与大多数现有方法相比，所提出的时序建模网络可以获得先进的识别性能。

关键词: 短期运动特征增强, 长期时序聚集, 时序建模, 二维卷积网络, 行为识别

LI Zhilei, LI Jun, SHI Zhiping, JIANG Na, ZHANG Yongkang. Efficient 2D Temporal Modeling Network for Video Action Recognition[J]. Computer Engineering and Applications, 2023, 59(3): 127-134.

栗志磊, 李俊, 施智平, 姜那, 张永康. 用于视频行为识别的高效二维时序建模网络[J]. 计算机工程与应用, 2023, 59(3): 127-134.

References

[1] KARPATHY A，TODERICI G，SHETTY S，et al.Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2014：1725-1732.
[2] SIMONYAN K，ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the Conference and Workshop on Neural Information Processing Systems，2014：568-576.
[3] TRAN D，BOURDEV L，FERGUS R，et al.Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：4489-4497.
[4] FEICHTENHOFER C，FAN H，MALIK J，et al.Slowfast networks for video recognition[C]//Proceedings of the IEEE International Conference on Computer Vision，2019：6202-6211.
[5] SUN L，JIA K，YEUNG D Y，et al.Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：4597-4605.
[6] BILEN H，FERNANDO B，GAVVES E，et al.Dynamic image networks for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：3034-3042.
[7] BILEN H，FERNANDO B，GAVVES E，et al.Action recognition with dynamic image networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，40（12）：2799-2813.
[8] VAROL G，LAPTEV I，SCHMID C.Long-term temporal convolutions for action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，40（6）：1510-1517.
[9] WANG L，XIONG Y，WANG Z，et al.Temporal segment networks：towards good practices for deep action recognition[C]//Proceedings of the European Conference on Computer Vision，2016：20-36.
[10] ZACH C，POCK T，BISCHOF H.A duality based approach for realtime tv-l 1 optical flow[C]//Joint Pattern Recognition Symposium.Berlin，Heidelberg：Springer，2007：214-223.
[11] QIU Z，YAO T，NGO C W，et al.Learning spatio-temporal representation with local and global diffusion[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：12056-12065.
[12] LI Y，SONG S，LI Y，et al.Temporal bilinear networks for video action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：8674-8681.
[13] GIRDHAR R，RAMANAN D，GUPTA A，et al.Learning spatio-temporal aggregation for action classification[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：971-980.
[14] LIN J，GAN C，HAN S.TSM：temporal shift module for efficient video understanding[C]//Proceedings of the IEEE International Conference on Computer Vision，2019：7083-7093.
[15] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[16] ZHAO Y，XIONG Y，LIN D.Recognize actions by disentangling components of dynamics[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：2204-2215.
[17] ZHAO Y，XIONG Y，LIN D.Trajectory convolution for action recognition[C]//Proceedings of the Conference and Workshop on Neural Information Processing Systems，2018：2204-2215.
[18] WANG X，GUPTA A.Videos as space-time region graphs[C]//Proceedings of the European Conference on Computer Vision，2018：399-417.
[19] ZOLFAGHARI M，SINGH K，BROX T.Efficient convolutional network for online video understanding[C]// Proceedings of the European Conference on Computer Vision，2018：695-712.
[20] XIE S，SUN C，HUANG J，et al.Rethinking spatiotemporal feature learning：speed-accuracy trade-offs in video classification[C]//Proceedings of the European Conference on Computer Vision，2018：318-335.
[21] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[22] YAN S，XIONG Y，LIN D.Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018：7444-7452.
[23] JIANG B，WANG M，GAN W，et al.STM：spatiotemporal and motion encoding for action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision，2019：2000-2009.
[24] LIU Z，LUO D，WANG Y，et al.Towards an efficient architecture for video recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：11669-11676.
[25] LIU Z，WANG L，WU W，et al.TAM：temporal adaptive module for video recognition[C]//Proceedings of the International Conference on Machine Learning，2021.
[26] WU L，ZOU Y，ZHANG C.Long-short temporal modeling for efficient action recognition[C]//Proceedings of the IEEE International Conference on Acoustics，Speech and Signal Processing，2021：2435-2439.
[27] LIU X，PINTEA S L，NEJADASL F K，et al.No frame left behind：full video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2021：14892-14901.
[28] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7132-7141.
[29] HOU Q，ZHOU D，FENG J.Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2021：13713-13722.
[30] SOOMRO K，ZAMIR A R，SHAH M.A dataset of 101 human action classes from videos in the wild[J].Center for Research in Computer Vision，2012，2（11）.
[31] KUEHNE H，JHUANG H，GARROTE E，et al.Hmdb：a large video database for human motion recognition[C]//Proceedings of the IEEE International Conference on Computer Vision，2011：2556-2563.
[32] GOYAL R，EBRAHIMI K S，MICHALSKI V，et al.The “something something” video database for learning and evaluating visual common sense[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：5842-5850.
[33] LUO C，YUILLE A L.Grouped spatial-temporal aggregation for efficient action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision，2019：5512-5521.
[34] FAN Q，CHEN C F，KUEHNE H，et al.More is less：learning efficient video representations by big-little network and depthwise temporal aggregation[C]//Proceedings of the Conference and Workshop on Neural Information Processing Systems，2019：2261-2270.
[35] LI Y，JI B，SHI X，et al.Tea：temporal excitation and aggregation for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2020：909-918.
[36] FAN L，BUCH S，WANG G，et al.Rubiksnet：learnable 3d-shift for efficient video action recognition[C]//Proceedings of the European Conference on Computer Vision，2020：505-521.
[37] CARREIRA J，ZISSERMAN A.Quo vadis，action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：6299-6308.
[38] SELVARAJU R R，COGSWELL M，DAS A，et al.Grad-cam：visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：618-626.