计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (20): 148-156.DOI: 10.3778/j.issn.1002-8331.2103-0280

• 模式识别与人工智能 • 上一篇    下一篇

X-YOWO:实时人体行为定位方法

袁赛美,黄怡蒙,冯李航,朱文俊,易阳   

  1. 南京工业大学 电气工程与控制科学学院,南京 211816
  • 出版日期:2022-10-15 发布日期:2022-10-15

X_YOWO:Real-Time Human Behavior Positioning Method

YUAN Saimei, HUANG Yimeng, FENG Lihang, ZHU Wenjun, YI Yang   

  1. College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing 211816, China
  • Online:2022-10-15 Published:2022-10-15

摘要: 基于视频的人体行为定位技术在城市安全系统、人机交互系统等领域具有广泛应用需求。针对现有人体行为定位技术模型复杂、定位精度与检测速度难以平衡的问题,提出了一类新的人体行为定位的深度学习框架X_YOWO,该框架继承了原YOWO的3D-CNN和2D-CNN两个分支,重新设计了通道融合与边界回归策略:通过基于相关系数矩阵的通道注意机制和相关性损失函数,使得模型在样本较少的情况下获得更多的有效特征,提高模型对特征的学习能力;采用一种基于距离概率大小来进行锚点聚类选择的方法,避免了原始聚类中心稳定性差的问题,使得改进后的锚点框大小更加适应数据集中目标大小的变化;采用CIoU回归损失函数作为目标函数,提高边界框回归的稳定性。在公开数据集UCF101-24和J-HMBD-21上对不同方法进行性能对比,当检测速度为22?frame/s时,使用X_YOWO后frame-mAP指标提高了3个百分点,不同阈值下的video-mAP指标也有较好表现。在自制的数据集上,当检测速度为22?frame/s时,X_YOWO的检测精度提高了3.6个百分点,定位精度提高了4.94个百分点,稳定性也更强。实验结果验证了X_YOWO在保证实时性前提下,具有更高的检测精度、稳定性及泛化能力。

关键词: X_YOWO, 人体行为定位, 瞄点框, 相关系数, 损失函数

Abstract: Human action localization technology based on video has been widely used in urban security system, human-computer interaction system and other fields. In view of the human action localization technology model is complex, positioning accuracy and detection speed are difficult to balance, this paper proposes a new framework X_YOWO for human action localization, which inherits the two branches of the YOWO, 3D-CNN and 2D-CNN, and redesigns the channel fusion and boundary regression strategy. Firstly, the channel attention mechanism based on the correlation coefficient matrix and correlation loss function make the model obtain more effective features in the case of fewer samples, so as to improve the model’s ability to learn features. Secondly, a method of anchor initionlization selection based on the distance probability is adopted to enhance stability of the original cluster center so that the size of the improved anchor point frame is more suitable for the change of the target size in the data set. Finally, the CIoU regression loss function is used as the objective function to improve the stability of the regression problem. Under the premise of a detection speed of 22 frame/s, the performance of different methods is compared on the public data sets UCF101-24 and J-HMBD-21. After using X_YOWO, the frame-mAP index increases by 3 percentage points, and the video-mAP index under different thresholds also performs well. On the self-made data set, X_YOWO can improve the detection accuracy by 3.6 percentage points, the positioning accuracy by 4.94 percentage points, and the stability is also stronger. It is verified that X_YOWO has higher detection accuracy, stronger generalization and better stability under the premise of ensuring real-time performance.

Key words: X_YOWO, human action localization, aiming point frame, correlation coefficient, loss function