Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (3): 202-208.DOI: 10.3778/j.issn.1002-8331.2109-0213

• Graphics and Image Processing • Previous Articles     Next Articles

Research on Human Behavior Recognition Based on Temporal and Spatial Information Fusion

YU Haigang, HE Ning, LIU Shengjie, HAN Wenjing   

  1. Beijing Key Laboratory of Information Service Engineering, College of Smart City, Beijing Union University, Beijing 100101, China
  • Online:2023-02-01 Published:2023-02-01

基于时空信息融合的人体行为识别研究

于海港,何宁,刘圣杰,韩文静   

  1. 北京联合大学 北京市信息服务工程重点实验室,北京 100101

Abstract: In video comprehension task, human behavior recognition is an important research content, but the temporal and spatial information fusion in video sequence is difficult and the accuracy is low. To solve these problems, this paper proposes a two-stream spatio-temporal residual convolution network model based on spatio-temporal information fusion. Firstly, RGB images and optical flow images are extracted from segmented video samples, and then are input into the two-stream spatio-temporal residual network. The depth spatio-temporal features of the video are extracted by the designed spatio-temporal residual module. Finally, the category results of each video segment are weighted and fused to obtain the behavior category. The two-stream space-time residual module proposed in this paper introduces a small amount of three-dimensional convolution and mixed attention mechanism, which can simultaneously obtain spatio-temporal information of different scales and suppress invalid information. It can effectively balance the problem of capturing and calculating spatio-temporal information, and improve the accuracy. The experiment is based on TSN network model and verified on UCF101 data set. Experimental results show that the accuracy of the proposed model is improved by 0.9 percentage points compared with the original TSN network model, and the efficiency of spatio-temporal information capture is effectively improved.

Key words: behavior recognition, two stream network, residual structure, attentional mechanism, temporal information

摘要: 在视频理解任务中,人体行为识别是一个重要的研究内容,但视频序列中存在时空信息融合困难、准确率低等问题。针对这些问题,提出一种基于时空信息融合的双流时空残差卷积网络模型。将视频分段采样提取RGB图像和光流图像,并将其输入到双流时空残差网络,通过设计的时空残差模块提取视频的深度时空特征,将每个视频片段的类别结果加权融合得到行为类别。提出的双流时空残差模块引入了少量的三维卷积和混合注意力机制,能够同时获取不同尺度的时空信息并且抑制无效信息,可以有效平衡时空信息的捕捉和计算量问题,并且提升了精度。实验基于TSN网络模型,在UCF101数据集上进行验证,实验结果表明提出的模型比原TSN网络模型的精准度提高了0.9个百分点,有效地提高了网络的时空信息捕获效率。

关键词: 行为识别, 双流网络, 残差结构, 注意力机制, 时序信息