基于细分多尺度和并行注意力的密集人群检测算法

doi:10.3778/j.issn.1002-8331.2409-0077

摘要/Abstract

摘要： 人群检测在自动驾驶、交通管理和智能安防等领域有着广泛的应用。其具有检测人群密度大、行人遮挡多、尺度变化大和人群分布不规则的特点，是计算机视觉中具有挑战性的问题之一。为了进一步挖掘密集场景下人群丰富的多尺度信息，以及应对人群分布和形状不规则的挑战，在Sparse R-CNN的基础上提出了一种基于细分多尺度和并行注意力的人群检测算法，命名为RMF R-CNN（refined multiscale feature R-CNN），其通过并行多个不同尺度的膨胀卷积构建感受野融合模块以提取细化的尺度信息。基于膨胀卷积注意力和可变形卷积注意力构建并行注意力模块，以从不同的尺度感知人群的分布与形状信息。为了缓解因数据误标注和行人尺度所导致的损失敏感，在原有损失函数的基础上加入了动态损失权重，使损失因行人尺度和预测准度而动态变化，提升模型的泛化能力。实验结果表明，所提算法在CrowdHuman、CityPersons等数据集中的AP为91.1%，MR?2为44.5%，Recall为96.7%。该算法能够在一定程度上提升密集场景中人群检测的性能。

关键词: 人群检测, 细分多尺度, 注意力机制, Sparse R-CNN, 动态损失权重

Abstract: Crowd detection has wide applications in fields such as autonomous driving, traffic management, and intelligent security. It is characterized by high crowd density, significant pedestrian occlusion, large scale variation, and irregular crowd distribution, which makes it one of the challenging problems in computer vision. To further explore the rich multi-scale information in dense scenes and address the challenges of irregular crowd distribution and shapes, a crowd detection algorithm based on refined multi-scale and parallel attention mechanisms is proposed in this paper, named as RMF R-CNN(refined multiscale feature R-CNN), building upon Sparse R-CNN. Firstly, a receptive field fusion module is proposed using parallel dilated convolutions of different scales to extract refined scale information. Then, a parallel attention module is constructed based on dilated convolution attention and deformable convolution attention to perceive crowd distribution and shape information from different scales. Finally, to mitigate loss sensitivity caused by data mislabeling and pedestrian scale, a dynamic loss weight is added to the original loss function, allowing the loss to dynamically change according to pedestrian scale and prediction accuracy, and enhancing the method’s generalization ability. Experimental results show that the proposed algorithm achieves an AP of 91.1%, an MR?2 of 44.5% and a Recall of 96.7% on datasets such as CrowdHuman and CityPersons. It also shows that the proposed algorithm can improve the performance of crowd detection in dense scenes.

Key words: crowd detection, refined multi-scale feature, attention mechanism, Sparse R-CNN, dynamic loss

张欣, 亢世宁, 杨寓淇, 王珺, 马致远. 基于细分多尺度和并行注意力的密集人群检测算法[J]. 计算机工程与应用, 2025, 61(23): 161-172.

ZHANG Xin, KANG Shining, YANG Yuqi, WANG Jun, MA Zhiyuan. Refined Multi-Scale Feature and Parallel Attention Based Crowd Detection[J]. Computer Engineering and Applications, 2025, 61(23): 161-172.

参考文献

[1] 卢振坤, 刘胜, 钟乐, 等. 人群计数研究综述[J]. 计算机工程与应用, 2022, 58(11): 33-46.
LU Z K, LIU S, ZHONG L, et al. Survey on reaserch of crowd counting[J]. Computer Engineering and Applications, 2022, 58(11): 33-46.
[2] 朱宇斌, 李文根, 关佶红, 等. 一种面向人群计数的卷积注意力网络模型[J]. 计算机工程与应用, 2023, 59(1): 156-161.
ZHU Y B, LI W G, GUAN J H, et al. Convolutional attention network for crowd counting[J]. Computer Engineering and Applications, 2023, 59(1): 156-161.
[3] 韩文静, 何宁, 刘圣杰, 等. 基于改进ResNet-CrowdDet的密集行人检测算法[J]. 计算机工程与应用, 2023, 59(16): 196-204.
HAN W J, HE N, LIU S J, et al. Dense pedestrian detection algorithm based on improved ResNet-CrowdDet[J]. Computer Engineering and Applications, 2023, 59(16): 196-204.
[4] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 580-587.
[5] GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1440-1448.
[6] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 779-788.
[7] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[8] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002.
[9] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[10] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 213-229.
[11] SUN P Z, ZHANG R F, JIANG Y, et al. Sparse R-CNN: end-to-end object detection with learnable proposals[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14449-14458.
[12] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 936-944.
[13] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 21-37.
[14] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2999-3007.
[15] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.
1556, 2014.
[16] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[17] ZHANG S L, WANG X J, WANG J Q, et al. Dense distinct query for end-to-end object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 7329-7338.
[18] LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3431-3440.
[19] WANG Y, WU Y, TANG S, et al. Hulk: a universal knowledge translator for human-centric tasks[J]. arXiv:2312.01697, 2023.
[20] ZHAO H, GALLO O, FROSIO I, et al. Loss functions for image restoration with neural networks[J]. IEEE Transactions on Computational Imaging, 2017, 3(1): 47-57.
[21] REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: a metric and a loss for bounding box regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 658-666.
[22] SHAO S, ZHAO Z, LI B, et al. CrowdHuman: a benchmark for detecting human in a crowd[J]. arXiv:1805.00123, 2018.
[23] ZHANG S S, BENENSON R, SCHIELE B. CityPersons: a diverse dataset for pedestrian detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4457-4465.
[24] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2014: 740-755.
[25] XU H H, WANG X Q, WANG D, et al. Object detection in crowded scenes via joint prediction[J]. Defence Technology, 2023, 21: 103-115.
[26] LIU S T, HUANG D, WANG Y H. Adaptive NMS: refining pedestrian detection in a crowd[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 6452-6461.
[27] CAI Z W, VASCONCELOS N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6154-6162.
[28] TIAN Z, SHEN C H, CHEN H, et al. FCOS: fully convolutional one-stage object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 9626-9635.
[29] ZHU X, SU W, LU L, et al. Deformable DETR: deformable Transformers for end-to-end object detection[J]. arXiv:2010. 04159, 2020.
[30] 肖振久, 李思琦, 曲海成. 基于多尺度特征与互监督的拥挤行人检测[J]. 计算机工程与科学, 2024, 46(7): 1278-1285.
XIAO Z J, LI S Q, QU H C. Pedestrian detection based on multi-scale features and mutual supervision[J]. Computer Engineering & Science, 2024, 46(7): 1278-1285.
[31] ZHANG G L, DU Z X, LU W J, et al. Dense pedestrian detection based on YOLO-v4 network reconstruction and CIoU loss optimization[J]. Journal of Physics: Conference Series, 2022, 2171(1): 012019.
[32] ZHANG S F, CHI C, YAO Y Q, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 9756-9765.
[33] CHU X G, ZHENG A L, ZHANG X Y, et al. Detection in crowded scenes: one proposal, multiple predictions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 12211-12220.
[34] HUANG X, GE Z, JIE Z Q, et al. NMS by representative region: towards crowded pedestrian detection by proposal pairing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10747-10756.