针对密集行人检测任务中多尺度目标的检测算法

doi:10.3778/j.issn.1002-8331.2405-0415

摘要/Abstract

摘要： 在密集行人检测任务中目标的检测精度低，漏检和误检等一直是充满挑战的问题，导致此问题的原因是大多数的场景中存在大量多尺度的目标，多尺度的目标使得算法面临着尺度变化，从而使得算法的精度不高。针对此问题，提出了一种基于改进YOLOv5s的多尺度行人检测网络（MPDNet）。网络改进包括三个方面：对于主干网络，在C3模块中添加了空间位置注意力模块，并引入改进的ViTv3Block模块，可以有效强化特征信息的提取；特征融合部分，在渐近特征金字塔网络（AFPN）的基础上进行了改进，改进后的AFPN可以在更少参数量和计算量的情况下进行跨层特征融合；在特征融合网络末端添加了空间加强多尺度注意力模块（SEMA），增强模型对目标的定位能力。通过分析实验结果，MPDNet在WiderPerson和CrowdHuman两个密集行人检测数据集上相较于YOLOv5s，AP50分别提升了4.2和3.2个百分点，AP50：95分别提升了5.0和3.9个百分点。MPDNet能够很好地完成复杂场景中密集行人检测任务。

关键词: YOLOv5s, 密集行人检测, 渐进多尺度特征融合, 目标检测, 注意力机制

Abstract: In the dense pedestrian detection task, the detection accuracy of the target is low, and missed detections and false detections always pose constant challenges. The reason for this problem is that there are a large number of multi-scale targets in most scenes. The multi-scale targets make the algorithm face scale changes, which makes the accuracy of the algorithm not high. To solve this problem, a multi-scale pedestrian detection network (MPDNet) based on improved YOLOv5s is proposed. The network improvement includes three aspects： firstly, for the backbone network, the spatial position attention module is added to the C3 module. Secondly, the improved ViTv3Block module is introduced, which can effectively enhance the extraction of feature information. The feature fusion part is improved on the basis of the asymptotic feature pyramid network (AFPN). The improved AFPN can perform cross layer feature fusion with less parameters and computation. Finally, a spatial enhanced multi-scale attention module (SEMA) is added at the end of the feature fusion network to enhance the model’s ability to locate targets. By analyzing the experimental results, the paper shows that MPDNet has increased AP50 by 4.2 and 3.2 percentage points, and AP50：95 by 5.0 and 3.9 percentage points, respectively, compared with YOLOv5s on the WiderPerson and CrowdHuman dense pedestrian detection data sets. MPDNet can well complete the task of dense pedestrian detection in complex scenes.

Key words: YOLOv5s, dense pedestrian detection, progressive multi-scale feature fusion, target detection, attention mechanism

徐振峰, 许云峰, 于子洲, 梅卫, 张妍. 针对密集行人检测任务中多尺度目标的检测算法[J]. 计算机工程与应用, 2025, 61(17): 304-316.

XU Zhenfeng, XU Yunfeng, YU Zizhou, MEI Wei, ZHANG Yan. Multi-Scale Target Detection Algorithm for Dense Pedestrian Detection Task[J]. Computer Engineering and Applications, 2025, 61(17): 304-316.

参考文献

[1] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 580-587.
[2] GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1440-1448.
[3] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[4] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 779-788.
[5] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6517-6525.
[6] REDMON J, FARHADI A . YOLOv3: an incremental improvement[J]. arXiv:1804.02767, 2018.
[7] BOCHKOVSKIY A, WANG C Y, LIAO H. YOLOv4: optimal speed and accuracy of object detection[J]. arXiv:2004. 10934, 2020.
[8] LI C Y, LI L, JIANG H L, et al. YOLOv6: a single-stage object detection framework for industrial applications[J]. arXiv:2209. 02976, 2022.
[9] WANG C Y, BOCHKOVSKIY A, LIAO H M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 7464-7475.
[10] WANG C Y, YEH I H, LIAO H Y M. YOLOv9: learning what you want to learn using programmable gradient information[J]. arXiv:2402.13616, 2024.
[11] MAO J Y, XIAO T T, JIANG Y N, et al. What can help pedestrian detection?[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6034-6043.
[12] HUANG S Q, XU J, LIU Z G, et al. Image haze removal based on rolling deep learning and Retinex theory[J]. IET Image Processing, 2022, 16(2): 485-498.
[13] CAI Z W, FAN Q F, FERIS R S, et al. A unified multi-scale deep convolutional neural network for fast object detection[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 354-370.
[14] YANG F, CHOI W, LIN Y Q. Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 2129-2137.
[15] CHI C, ZHANG S F, XING J L, et al. PedHunter: occlusion robust pedestrian detector in crowded scenes[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 10639-10646.
[16] 王泽宇, 徐慧英, 朱信忠, 等. 基于YOLOv8改进的密集行人检测算法: MER-YOLO[J]. 计算机工程与科学, 2024, 46(6): 1050-1062.
WANG Z Y, XU H Y, ZHU X Z, et al. An improved dense pedestrian detection algorithm based on YOLOv8: mer-YOLO[J]. Computer Engineering & Science, 2024, 46(6): 1050-1062.
[17] 魏志, 刘罡, 张旭. 基于MobileNet的轻量化密集行人检测算法[J]. 软件工程, 2024, 27(6): 6-9.
WEI Z, LIU G, ZHANG X. Lightweight dense pedestrian detection algorithm based on MobileNet[J]. Software Engineering, 2024, 27(6): 6-9.
[18] 袁翔, 程塨, 李戈, 等. 遥感影像小目标检测研究进展[J]. 中国图象图形学报, 2023, 28(6): 1662-1684.
YUAN X, CHENG G, LI G, et al. Progress in small object detection for remote sensing images[J]. Journal of Image and Graphics, 2023, 28(6): 1662-1684.
[19] LIU C D, XU Y F, ZHONG J K. SLAM: a lightweight spatial location attention module for object detection[C]//Proceedings of the Neural Information Processing. Singapore: Springer Nature Singapore, 2024: 373-387.
[20] YANG G Y, LEI J, ZHU Z K, et al. AFPN: asymptotic feature pyramid network for object detection[C]//Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. Piscataway: IEEE, 2023: 2184-2189.
[21] VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[23] MEHTA S, RASTEGARI M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer[J]. arXiv: 2110. 02178, 2021.
[24] MEHTA S, RASTEGARI M. Separable self-attention for mobile vision transformers[J]. arXiv:2206.02680, 2022.
[25] WADEKAR S, CHAURASIA A. MobileViTv3: mobile-friendly vision transformer with simple and effective fusion of local, global and input features[J]. arXiv:2209.15159, 2022.
[26] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 936-944.
[27] ZHAO H Y, KONG X T, HE J W, et al. Efficient image super-resolution using pixel attention[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 56-72.
[28] TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10778-10787.
[29] HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 13708-13717.
[30] OUYANG D L, HE S, ZHANG G Z, et al. Efficient multi-scale attention module with cross-spatial learning[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5.
[31] ZHANG S F, XIE Y L, WAN J, et al. WiderPerson: a diverse dataset for dense pedestrian detection in the wild[J]. IEEE Transactions on Multimedia, 2020, 22(2): 380-393.
[32] SHAO S, ZHAO Z, LI B, et al. Crowdhuman: a benchmark for detecting human in a crowd[J]. arXiv:1805.00123, 2018.
[33] RUKHOVICH D, SOFIIUK K, GALEEV D, et al. IterDet: iterative scheme for object detection in crowded environments[C]//Proceedings of the Structural, Syntactic, and Statistical Pattern Recognition. Cham: Springer International Publishing, 2021: 344-354.
[34] GE Z, JIE Z Q, HUANG X, et al. PS-RCNN: detecting secondary human instances in a crowd via primary object suppression[C]//Proceedings of the IEEE International Conference on Multimedia and Expo. Piscataway: IEEE, 2020: 1-6.
[35] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2999-3007.
[36] CAI Z W, VASCONCELOS N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6154-6162.
[37] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 3-19.
[38] WANG Q L, WU B G, ZHU P F, et al. ECA-net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 11531-11539.