基于时空信息融合的人体行为识别研究

doi:10.3778/j.issn.1002-8331.2109-0213

摘要/Abstract

摘要： 在视频理解任务中，人体行为识别是一个重要的研究内容，但视频序列中存在时空信息融合困难、准确率低等问题。针对这些问题，提出一种基于时空信息融合的双流时空残差卷积网络模型。将视频分段采样提取RGB图像和光流图像，并将其输入到双流时空残差网络，通过设计的时空残差模块提取视频的深度时空特征，将每个视频片段的类别结果加权融合得到行为类别。提出的双流时空残差模块引入了少量的三维卷积和混合注意力机制，能够同时获取不同尺度的时空信息并且抑制无效信息，可以有效平衡时空信息的捕捉和计算量问题，并且提升了精度。实验基于TSN网络模型，在UCF101数据集上进行验证，实验结果表明提出的模型比原TSN网络模型的精准度提高了0.9个百分点，有效地提高了网络的时空信息捕获效率。

关键词: 行为识别, 双流网络, 残差结构, 注意力机制, 时序信息

Abstract: In video comprehension task, human behavior recognition is an important research content, but the temporal and spatial information fusion in video sequence is difficult and the accuracy is low. To solve these problems, this paper proposes a two-stream spatio-temporal residual convolution network model based on spatio-temporal information fusion. Firstly, RGB images and optical flow images are extracted from segmented video samples, and then are input into the two-stream spatio-temporal residual network. The depth spatio-temporal features of the video are extracted by the designed spatio-temporal residual module. Finally, the category results of each video segment are weighted and fused to obtain the behavior category. The two-stream space-time residual module proposed in this paper introduces a small amount of three-dimensional convolution and mixed attention mechanism, which can simultaneously obtain spatio-temporal information of different scales and suppress invalid information. It can effectively balance the problem of capturing and calculating spatio-temporal information, and improve the accuracy. The experiment is based on TSN network model and verified on UCF101 data set. Experimental results show that the accuracy of the proposed model is improved by 0.9 percentage points compared with the original TSN network model, and the efficiency of spatio-temporal information capture is effectively improved.

Key words: behavior recognition, two stream network, residual structure, attentional mechanism, temporal information

于海港, 何宁, 刘圣杰, 韩文静. 基于时空信息融合的人体行为识别研究[J]. 计算机工程与应用, 2023, 59(3): 202-208.

YU Haigang, HE Ning, LIU Shengjie, HAN Wenjing. Research on Human Behavior Recognition Based on Temporal and Spatial Information Fusion[J]. Computer Engineering and Applications, 2023, 59(3): 202-208.

参考文献

[1] SIMONYAN K，ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[J].arXiv：1406. 2199，2014.
[2] TRAN D，BOURDEV L，FERGUS R，et al.Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：4489-4497.
[3] FEICHTENHOFER C.X3d：expanding architectures for efficient video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：203-213.
[4] ROY A G，NAVAB N，WACHINGER C.Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2018：421-429.
[5] CHEN Y，GONG S M.Human action recognition network based on improved channel attention mechanism[J].Journal of Electronics & Information Technology，2021：43（12）：3538-3545.
[6] CHO S，MAQBOOL M，LIU F，et al.Self-attention network for skeleton-based human action recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision，2020：635-644.
[7] WANG Z，SHE Q，SMOLIC A.ACTION-Net：multipath excitation for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：13214-13223.
[8] FEICHTENHOFER C，PINZ A，ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：1933-1941.
[9] LIU C，YING J，YANG H，et al.Improved human action recognition approach based on two-stream convolutional neural network model[J].The Visual Computer，2021，37（6）：1327-1341.
[10] WANG L，XIONG Y，WANG Z，et al.Temporal segment networks：towards good practices for deep action recognition[C]//European Conference on Computer Vision.Cham：Springer，2016：20-36.
[11] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[12] XIE S，GIRSHICK R，DOLLáR P，et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：1492-1500.
[13] SZEGEDY C，LIU W，JIA Y，et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition，2015：1-9.
[14] GAO S，CHENG M M，ZHAO K，et al.Res2net：a new multi-scale backbone architecture[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，43（2）：652-662.
[15] DENG J，DONG W，SOCHER R，et al.Imagenet：a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition，2009：248-255.
[16] ZACH C，POCK T，BISCHOF H.A duality based approach for realtime tv-l 1 optical flow[C]//Joint Pattern Recognition Symposium.Berlin，Heidelberg：Springer，2007：214-223.
[17] ZHANG C C，HE N.Human motion recognition based on key frame two-stream convolutional network[J].Journal of Nanjing University of Information Science and Technology（Science Edition），2019，11（6）：716-721.
[18] ZHOU Y，SUN X，LUO C，et al.Spatiotemporal fusion in 3D CNNs：a probabilistic view[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：9829-9838.
[19] WOO S，PARK J，LEE J Y，et al.Cbam：convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision，2018：3-19.
[20] YUAN C M，NIU Y，GUO T，et al.Pedestrian re-recognition based on clothing feature transfer[J].Computer Science and Exploration，2020，15（9）：1740-1752.