X_YOWO：Real-Time Human Behavior Positioning Method

doi:10.3778/j.issn.1002-8331.2103-0280

Abstract

Abstract: Human action localization technology based on video has been widely used in urban security system, human-computer interaction system and other fields. In view of the human action localization technology model is complex, positioning accuracy and detection speed are difficult to balance, this paper proposes a new framework X_YOWO for human action localization, which inherits the two branches of the YOWO, 3D-CNN and 2D-CNN, and redesigns the channel fusion and boundary regression strategy. Firstly, the channel attention mechanism based on the correlation coefficient matrix and correlation loss function make the model obtain more effective features in the case of fewer samples, so as to improve the model’s ability to learn features. Secondly, a method of anchor initionlization selection based on the distance probability is adopted to enhance stability of the original cluster center so that the size of the improved anchor point frame is more suitable for the change of the target size in the data set. Finally, the CIoU regression loss function is used as the objective function to improve the stability of the regression problem. Under the premise of a detection speed of 22 frame/s, the performance of different methods is compared on the public data sets UCF101-24 and J-HMBD-21. After using X_YOWO, the frame-mAP index increases by 3 percentage points, and the video-mAP index under different thresholds also performs well. On the self-made data set, X_YOWO can improve the detection accuracy by 3.6 percentage points, the positioning accuracy by 4.94 percentage points, and the stability is also stronger. It is verified that X_YOWO has higher detection accuracy, stronger generalization and better stability under the premise of ensuring real-time performance.

Key words: X_YOWO, human action localization, aiming point frame, correlation coefficient, loss function

摘要： 基于视频的人体行为定位技术在城市安全系统、人机交互系统等领域具有广泛应用需求。针对现有人体行为定位技术模型复杂、定位精度与检测速度难以平衡的问题，提出了一类新的人体行为定位的深度学习框架X_YOWO，该框架继承了原YOWO的3D-CNN和2D-CNN两个分支，重新设计了通道融合与边界回归策略：通过基于相关系数矩阵的通道注意机制和相关性损失函数，使得模型在样本较少的情况下获得更多的有效特征，提高模型对特征的学习能力；采用一种基于距离概率大小来进行锚点聚类选择的方法，避免了原始聚类中心稳定性差的问题，使得改进后的锚点框大小更加适应数据集中目标大小的变化；采用CIoU回归损失函数作为目标函数，提高边界框回归的稳定性。在公开数据集UCF101-24和J-HMBD-21上对不同方法进行性能对比，当检测速度为22?frame/s时，使用X_YOWO后frame-mAP指标提高了3个百分点，不同阈值下的video-mAP指标也有较好表现。在自制的数据集上，当检测速度为22?frame/s时，X_YOWO的检测精度提高了3.6个百分点，定位精度提高了4.94个百分点，稳定性也更强。实验结果验证了X_YOWO在保证实时性前提下，具有更高的检测精度、稳定性及泛化能力。

关键词: X_YOWO, 人体行为定位, 瞄点框, 相关系数, 损失函数

YUAN Saimei, HUANG Yimeng, FENG Lihang, ZHU Wenjun, YI Yang. X_YOWO：Real-Time Human Behavior Positioning Method[J]. Computer Engineering and Applications, 2022, 58(20): 148-156.

袁赛美, 黄怡蒙, 冯李航, 朱文俊, 易阳. X-YOWO：实时人体行为定位方法[J]. 计算机工程与应用, 2022, 58(20): 148-156.

References

[1] 张海民.深度学习下智慧社区视频监控异常识别方法[J].西安工程大学学报，2020，34（2）：106-112.
ZHANG Haimin.Research on anomaly recognition method of video surveillance in smart community based on deep learning[J].Journal of Xi’an Polytechnic University，2020，34（2）：106-112.
[2] 董莹荷，胡国胜.视频监控系统中异常行为检测与识别[J].机械设计与制造工程，2020，49（3）：66-70.
DONG Yinghe，HU Guosheng.Detection and identification of the abnormal behavior in video surveillance systems[J].Mechanical Design and Manufacturing Engineering，2020，49（3）：66-70.
[3] 巢新，侯振杰，李兴，等.深度时空能量特征表示下的人体行为识别[J].中国图象图形学报，2020，25（4）：836-850.
CHAO Xin，HOU Zhenjie，LI Xing，et al.Action recognition under depth spatial-temporal energy feature representation[J].Journal of Image and Graphics，2020，25（4）：836-850.
[4] K?PüKLü O，WEI X，RIGOLL G.You only watch once：a unified CNN architecture for real-time spatiotemporal action localization[J].arXiv：1911.06644，2019.
[5] HARA K，KATAOKA H，SATOH Y.Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet[C]//CVPR2018，2018.
[6] TRAN D，BOURDEV L，FERGUS R，et al.Learning spatiotemporal features with 3D convolutional networks[C]//IEEE International Conference on Computer Vision，2015.
[7] MO S，TAN X，XIA J，et al.Towards improving spatiotemporal action recognition in videos[J].arXiv：2012.08097，2020.
[8] GUPTA A，DESAI M，LIANG W，et al.Spatiotemporal action recognition in restaurant videos[J].arXiv：2008.11149，2020.
[9] BOKHARI S M，SOHAIB S，KHAN A R，et al.DGRU based human activity recognition using channel state information[J].Measurement，2020：108245.
[10] 李瑞峰，王亮亮，王珂.人体动作行为识别研究综述[J].模式识别与人工智能，2016（1）：37-50.
LI Ruifeng，WANG Liangliang，WANG Ke.A survey of human body action recognition[J].Pattern Recognition and Artificial Intelligence，2016（1）：37-50.
[11] 祁家榕，张昌伟.行为识别技术的研究与发展[J].智能计算机与应用，2017，7（4）：24-26.
QI Jiarong，ZHANG Changwei.Research and development of behavior recognition technology[J].Intelligent Computers and Applications，2017，7（4）：24-26.
[12] CUI X，QI M，NIU Y，et al.The intra-class and inter-class relationships in style transfer[J].Applied Sciences，2018，8（9）：1681.
[13] ZENG W，LU T，LIU Z，et al.Research on a laser ultrasonic visualization detection method for human skin tumors based on pearson correlation coefficient[J].Optics & Laser Technology，2021，141（6）：107117.
[14] 宋艳艳，谭励，马子豪，等.改进YOLOV3算法的视频目标检测[J].计算机科学与探索，2021，15（1）：163-172.
SONG Yanyan，TAN Li，MA Zihao，et al.Video target detection based on improved YOLOV3 algorithm[J].Journal of Frontiers of Computer Science & Technology，2021，15（1）：163-172.
[15] BUCH S，ESCORCIA V，SHEN C，et al.SST：single-stream temporal action proposals[C]//IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2017.
[16] SIMONYAN K，ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems 27：Annual Conference on Neural Information Processing Systems，2014：568-576.
[17] FEICHTENHOFER C，PINZ A，ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[J].arXiv：1604.06573，2016.
[18] ZHENG Z，WANG P，LIU W，et al.Distance-IoU loss：faster and better learning for bounding box regression[C]//AAAI Conference on Artificial Intelligence，2020.
[19] REZATOFIGHI H，TSOI N，GWAK J Y，et al.Generalized intersection over union：a metric and a loss for bounding box regression[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：658-666.
[20] QIN X，GE Y，ZHAN L，et al.Joint deep learning for RGB-D action recognition[C]//IEEE Visual Communications and Image Processing（VCIP），2019.
[21] WANG L，XIONG Y，WANG Z，et al.Temporal segment networks：towards good practices for deep action recognition[C]//Lecture Notes in Computer Science，2016：20-36.
[22] SINGH G，SAHA S，SAPIENZA M，et al.Deep learning for detecting multiple space-time action tubes in videos[J].arXiv：1608.01529，2016.
[23] REDMON J，FARHADI A.YOLOv3：an incremental improvement[J].arXiv：1804.02767，2018.
[24] REPETTI A，PHAM M Q，DUVAL L，et al.Euclid in a taxicab：sparse blind deconvolution with smoothed l1/l2 regularization[J].IEEE Signal Processing Letters，2014，22（5）：539-543.