基于并行混合注意力的复杂背景小尺度手部检测方法

doi:10.3778/j.issn.1002-8331.2307-0302

摘要/Abstract

摘要： 针对复杂背景中手部特征不明显及尺度变化较大，难以满足高精度水平检测，易出现误检、漏检的问题，以YOLOv5为基础结构提出一种小尺度手部检测方法。将并行混合机制的注意力模块（parallel mixed attention mechanism，PMAM）嵌入到主干网络中，提高对手部特征的提取能力；设计一种结合路径聚合网络（path aggregation network，PAN）和加权双向特征金字塔网络（bidirectional feature pyramid network，BiFPN）改进的特征融合网络PB-FPN（path bidirectional-feature pyramid network），引入新的路径参与底部特征融合，提高算法对小尺度手部目标的检测能力；通过将骨干网络中的空间金字塔池化（spatial pyramid pooling-fast，SPPF）引入特征融合网络并与模型预测头连接，进一步提高算法的性能。在此基础上，使用FReLU作为网络模型的激活函数，增强网络的空间敏感度，提高网络鲁棒性。为验证所提方法的有效性，构建了更符合研究背景的新的数据集TV-COCO-Hand，并在此数据集上进行了相关实验，结果表明，改进后的模型在构建的数据集上mAP达到91.4%，比基线网络模型提高了3.8个百分点，且检测效果优于目前主流检测网络模型。在公开数据集上进行了数据集对比实验以及真实场景的检测实验，验证了模型的泛化性。

关键词: 机器视觉, 手部检测, 并行混合注意机制, FReLU, 特征融合

Abstract: In response to the challenges posed by unclear hand features and significant scale variations in complex backgrounds, this paper proposes a small-scale hand detection method based on YOLOv5. Firstly, a parallel mixed attention mechanism (PMAM) is designed and integrated into the backbone network to enhance the extraction of hand features. Secondly, a path bidirectional-feature pyramid network (PB-FPN) is introduced, combining path aggregation network (PANet) and bidirectional feature pyramid network (BiFPN), and incorporating new pathways for bottom-level feature fusion to improve the detection capability of small-scale hand objects. Furthermore, the spatial pyramid pooling-fast (SPPF) from the backbone network is incorporated into the feature fusion network and is connected with the prediction heads of the model to further enhance the algorithm performance. FReLU is utilized as the activation function in the network model to improve spatial sensitivity and robustness. To validate the effectiveness of the proposed method, a new dataset named TV-COCO-Hand, tailored to the research context, is constructed and used for related experiments. The results show that the improved model achieves an mAP of 91.4% on the constructed dataset, which is a 3.8 percentage points improvement over the baseline network model, and outperforms current mainstream detection network models. Additionally, the dataset comparison experiment and real-world scenarios detection experiment on public datasets are conducted to verify the generalization of the model.

Key words: computer vision, hand detection, parallel mixed attention mechanism, FReLU, feature fusion

梁超, 王阳萍, 王文润. 基于并行混合注意力的复杂背景小尺度手部检测方法[J]. 计算机工程与应用, 2024, 60(22): 209-218.

LIANG Chao, WANG Yangping, WANG Wenrun. Small-Scale Hand Detection Method in Complex Backgrounds Based on Parallel Mixed Attention Mechanism[J]. Computer Engineering and Applications, 2024, 60(22): 209-218.

参考文献

[1] MITTAL A, ZISSERMAN A, TORR P. Hand detection using multiple proposals[C]//Proceedings of the 22nd British Machine Vision Conference, Dundee, Aug 29-Sep 2, 2011.
[2] ROY K, MOHANTY A, SAHAY R R. Deep learning based hand detection in cluttered environment using skin segmentation[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops, 2017.
[3] DENG X M, ZHANG Y D, YANG S, et al. Joint hand detection and rotation estimation using CNN[J]. IEEE Transactions on Image Processing, 2018, 27(4): 1888-1900.
[4] GAO Q, LIU J, JU Z. Robust real-time hand detection and localization for space human-robot interaction based on deep learning[J]. Neurocomputing, 2020, 390: 198-206.
[5] KARBASI M, BHATTI Z, AGHABABAEYAN R, et al. Real-time hand detection by depth images: a survey[J]. Jurnal Teknologi, 2016, 78(2): 141-148.
[6] LE T H N, QUACH K G, ZHU C, et al. Robust hand detection and classification in vehicles and in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, Jul 21-26, 2017.
[7] ASHIQUZZAMAN A, LEE H, KIM K, et al. Compact spatial pyramid pooling deep convolutional neural network based hand gestures decoder[J]. Applied Sciences, 2020, 10: 7898.
[8] WANG C Y, LIAO H Y M, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020: 390-391.
[9] ZHU X, LYU S, WANG X, et al. TPH-YOLOV5: improved YOLOV5 based on transformer prediction head for object detection on drone-captured scenarios[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 2778-2788.
[10] ITTI L, KOCH C. Computational modelling of visual attention[J]. Nature Reviews Neuroscience, 2001, 2(3): 194-203.
[11] WOO S, PARK J, LEE JY, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, 2018: 3-19.
[12] JADERBERG M, SIMONYAN K, ZISSERMAN A, et al. Spatial transformer networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015: 2017-2025.
[13] LI X, WANG W H, HU X L, et al. Selective kernel networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 510-519.
[14] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017: 2117-2125.
[15] LIU S, QI L, QIN H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-23, 2018: 8759-8768.
[16] TAN M, PANG R, LE Q V. EffificientDet: scalable and effificient object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020: 10781-10790.
[17] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37: 1904-1916.
[18] MAAS A L, HANNUN A Y, NG A Y. Rectifier nonlinearities improve neural network acoustic models[C]//ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
[19] MA N, ZHANG X, SUN J. Funnel activation for visual recog-
nition[J]. arXiv:2007.11824, 2020.
[20] HE K, ZHANG X, REN S, et al. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015: 1026-1034.
[21] HOAI M, ZISSERMAN A. Thread-safe: towards recognizing human actions across shot boundaries[C]//Proceedings of the 12th Asian Conference on Computer Vision, Singapore, 2014: 222-237.
[22] LIN T Y, MAIREM S, BELONGIE L, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, 2014: 740-755.
[23] WANG C Y, BOCHKOVSKIY A, LIAO H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[J]. arXiv:2207.02696, 2022.