计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (3): 127-134.DOI: 10.3778/j.issn.1002-8331.2108-0016

• 模式识别与人工智能 • 上一篇    下一篇

用于视频行为识别的高效二维时序建模网络

栗志磊,李俊,施智平,姜那,张永康   

  1. 首都师范大学 信息工程学院,北京 100089
  • 出版日期:2023-02-01 发布日期:2023-02-01

Efficient 2D Temporal Modeling Network for Video Action Recognition

LI Zhilei, LI Jun, SHI Zhiping, JIANG Na, ZHANG Yongkang   

  1. School of Information Engineering, Capital Normal University, Beijing 100089, China
  • Online:2023-02-01 Published:2023-02-01

摘要: 二维卷积难以对视频数据进行有效的时间信息建模。针对这个问题,提出了一个高效的基于二维卷积的时间建模网络。该网络只需要RGB图像作为输入,避免了复杂的光流计算,在低计算复杂度的前提下,可以在行为识别任务中达到先进的准确性。网络主要由两个部分组成,即运动特征增强模块和时序聚集模块。具体来说,运动特征增强模块主要实现短期时序建模,它利用当前帧与相邻帧的差异信息对当前帧中的运动信息进行自适应性的增强,让网络能够了解图像中的哪一部分将要产生运动。时序聚集模块实现长期的时序建模,主要应用于网络的后期,通过二维卷积对时序上的信息进行信息聚合,让每一帧图像经过网络提取特征后,都能够结合时序上所有帧序列的信息。在三个常见的视频动作识别数据集(UCF101、HMDB51和Something-Something V1)上进行的大量实验表明,与大多数现有方法相比,所提出的时序建模网络可以获得先进的识别性能。

关键词: 短期运动特征增强, 长期时序聚集, 时序建模, 二维卷积网络, 行为识别

Abstract: 2D convolution is difficult to perform effective temporal information modeling for video data. To address this issue, an efficient 2D convolution-based temporal modeling network is proposed. The network only needs RGB images as input, avoiding complicated optical flow calculations, and can achieve advanced accuracy in behavior recognition tasks under the premise of low computational complexity. The network is mainly composed of two parts, named the motion feature enhancement module and the temporal aggregation module. Concretely, the motion feature enhancement module mainly focus on short-term temporal modeling, which adaptively enhances the motion information in the current frame using the difference information between the current frame and adjacent frames, and allows the network to understand which part of the image is about to generate motion. The temporal aggregation module implements long-term temporal modeling, which is mainly applied in the later stage of the network. It aggregates the information on the temporal sequence by 2D convolution, so that each frame can combine the information of all frames in the temporal dimension after the features are extracted by the network. Finally, extensive experiments conducted on three common video action recognition datasets(UCF101, HMDB51 and Something-Something V1) demonstrate that the proposed temporal modeling network can obtain advanced recognition performance compared with most of the existing methods.

Key words: short-term motion feature enhancement, long-term temporal aggregation, temporal modeling, 2D convolutional network, action recognition