Crowd Counting Algorithm for Multi-Scale Fusion Based on Dual Branch Feature Extraction

doi:10.3778/j.issn.1002-8331.2305-0427

Abstract

Abstract: Crowd counting has important applications in public safety management, public space design, and other visual tasks such as behavior analysis and congestion analysis. However, the complexity of the background and the varying size of the head scale result in unsatisfactory crowd counting performance. To address the issues of scale changes and background interference in static images, a crowd counting network based on dual branch intermediate feature extraction is proposed. The network follows the encoder decoder structure and uses the first 16 layers of VGG19 convolutional neural network in the encoding stage. In order to better fuse multi-scale information, it replaces the last 4 convolutions of the first 16 layers of the VGG19 convolutional neural network with dilated convolutions with a vacancy rate of 2. The decoding part uses a residual convolutional attention module (RCAM) to suppress background interference, and inserts a dual branch intermediate feature extraction module (DBFE) in the middle of the encoder decoder structure. Branch 1 adopts a pyramid structure and integrates the position attention module to extract multi-scale contextual information, branch 2 follows a pyramid structure and integrates a dual channel attention mechanism to focus the model on different sizes of head information, and finally uses 1×1 generate density maps through convolution. In terms of experiments, algorithm comparison experiments are carried out on the data sets of ShanghaiTech PartA, ShanghaiTech PartB and Mall. The average absolute error and root mean square error of the model in the above data sets are 63.2, 7.1, 1.80 and 99.2, 11.8, 2.28, respectively. Through comparative experimental analysis, the model has good counting performance and stability. Ablation experiments are conducted on ShanghaiTech PartB, which verifies the effectiveness of each module of the model.

Key words: crowd counting, VGG19, encoder decoder, residual convolutional attention module (RCAM), dual branch intermediate feature extraction module (DBFE)

摘要： 人群计数在公共安全管理、公共空间设计以及其他视觉任务如行为分析、拥塞分析等方面具有重要的应用。然而复杂的背景和人头尺度大小不一导致人群计数的效果并不理想。针对静态图像中尺度变化和背景干扰问题，提出了一种基于双分支中间特征提取的人群计数网络——DBFE_MFNet。该网络沿用编码-解码器结构，在编码阶段使用VGG19卷积神经网络的前16层，为了更好融合多尺度信息，将VGG19卷积神经网络的前16层的后4层卷积替换成空洞率为2的膨胀卷积，解码部分采用抑制背景干扰的残差卷积注意力模块（residual convolutional attention module，RCAM），在编码-解码器结构中间插入双分支中间特征提取模块（dual branch intermediate feature extraction module，DBFE），分支1采用金字塔结构并融合位置注意力模块提取多尺度上下文信息，分支2沿用金字塔结构融合双通道注意力机制使模型关注不同大小人头信息，最后使用1×1卷积生成密度图。实验方面，在ShanghaiTech PartA、ShanghaiTech PartB、Mall数据集上进行了算法对比实验，DBFE_MFNet模型在上述数据集的平均绝对误差和均方根误差分别为63.2、7.1、1.80和99.2、11.8、2.28，经对比实验分析，DBFE_MFNet模型具有不错的计数性能和稳定性能；在ShanghaiTech PartB进行了消融实验，实验验证了模型各模块的有效性。

关键词: 人群计数, VGG19, 编码-解码器, 残差卷积注意力模块, 双分支中间特征提取模块

ZENG Yunyun, ZHANG Hongying, YUAN Mingdong. Crowd Counting Algorithm for Multi-Scale Fusion Based on Dual Branch Feature Extraction[J]. Computer Engineering and Applications, 2024, 60(20): 224-232.

曾芸芸, 张红英, 袁明东. 多尺度融合的双分支特征提取人群计数算法[J]. 计算机工程与应用, 2024, 60(20): 224-232.

References

[1] CHAN A B, LIANG Z S J, VASCONCELOS N. Privacy preserving crowd monitoring: counting people without people models or tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008: 1-7.
[2] CHEN K, GONG S, XIANG T, et al. Cumulative attribute space for age and crow density estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013: 2467-2474.
[3] CHO S Y, CHOW T W, LEUNG C T. A neural-based crowd estimation by hybrid global learning algorithm[J]. IEEE Transactions on Systems, Man, and Cybernetics (PartB), 1999, 29(4): 535-541.
[4] DAVIES A C, YIN J H, VELASTIN S A. Crowd monitoring using image processing[J]. IEEE Electronics & Communication Engineering Journal, 1995, 7(1): 37-47.
[5] ZHANG Y, ZHOU D, CHEN S, et al. Single-image crowd counting via multi-column convolutional neural network[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[6] LI Y H, ZHANG X F, CHEN D M. CSRNet: dilated convolutional neural networks for understanding the highly congested scenes[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 1091-1100.
[7] LIU W Z, SALZMANN M, FUA P. Context-aware crowd counting[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5094-5103.
[8] MA Z, WEI X, HONG X, et al. Bayesian loss for crowd count estimation with point supervision[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 6142-6151.
[9] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision, 2018: 3-19.
[10] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.
1556, 2014.
[11] ZHAN G, GE W, YU Y. GraphFPN: graph feature pyramid network for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 2763-2772.
[12] RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]//Proceedings of the Medical Image Computing and Computer Assisted Intervention, 2015: 234-241.
[13] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 21-37.
[14] SULLIVAN A, LU X. ASPP: a new family of oncogenes and tumour suppressor genes[J]. British Journal of Cancer, 2007, 96(2): 196-200.
[15] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[16] MNIH V, HEESS N, GRAVES A. Recurrent models of visual attention[J]. arXiv:1406.6247, 2014.
[17] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[18] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[19] SAM D B, SURYA S, BABU R V. Switching convolutional neural network for crowd counting[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4031-4039.
[20] CAO X, WANG Z, ZHAO Y, et al. Scale aggregation network for accurate and efficient crowd counting[C]//Proceedings of the European Conference on Computer Vision, 2018: 757-773.
[21] CHEN K, LOY C C, GONG S, et al. Feature mining for localised crowd counting[C]//Proceedings of the British Machine Vision Conference, 2012.
[22] XU M, GE Z, JIANG X, et al. Depth information guided crowd counting for complex crowd scenes[J]. Pattern Recognition Letters, 2019, 125: 563-569.
[23] ZOU Z, CHENG Y, QU X, et al. Attend to count: crowd counting with adaptive capacity multi-scale CNNs[J]. Neurocomputing, 2019: 75-83.
[24] KONG X, ZHAO M, ZHOU H, et al. Weakly supervised crowd-wise attention for robust crowd counting[C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
[25] 袁健, 王姗姗, 罗英伟. 基于图像视野划分的公共场所人群计数模型[J]. 计算机应用研究, 2021, 38(4): 1256-1260.
YUAN J, WANG S S, LUO Y W. Public place crowd counting model based on image field division[J]. Application Research of Computers, 2021, 38(4): 1256-1260.