计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (20): 224-232.DOI: 10.3778/j.issn.1002-8331.2305-0427

• 图形图像处理 • 上一篇    下一篇

多尺度融合的双分支特征提取人群计数算法

曾芸芸,张红英,袁明东   

  1. 1.西南科技大学 信息工程学院,四川 绵阳 621010
    2.西南科技大学 特殊环境机器人技术四川省重点实验室,四川 绵阳 621010
  • 出版日期:2024-10-15 发布日期:2024-10-15

Crowd Counting Algorithm for Multi-Scale Fusion Based on Dual Branch Feature Extraction

ZENG Yunyun, ZHANG Hongying, YUAN Mingdong   

  1. 1.School of Information Engineering, Southwest University of Science and Technology, Mianyang, Sichuan 621010, China
    2.Robot Technology Used for Special Environment Key Laboratory of Sichuan Provincial, Southwest University of Science and Technology, Mianyang, Sichuan 621010, China
  • Online:2024-10-15 Published:2024-10-15

摘要: 人群计数在公共安全管理、公共空间设计以及其他视觉任务如行为分析、拥塞分析等方面具有重要的应用。然而复杂的背景和人头尺度大小不一导致人群计数的效果并不理想。针对静态图像中尺度变化和背景干扰问题,提出了一种基于双分支中间特征提取的人群计数网络——DBFE_MFNet。该网络沿用编码-解码器结构,在编码阶段使用VGG19卷积神经网络的前16层,为了更好融合多尺度信息,将VGG19卷积神经网络的前16层的后4层卷积替换成空洞率为2的膨胀卷积,解码部分采用抑制背景干扰的残差卷积注意力模块(residual convolutional attention module,RCAM),在编码-解码器结构中间插入双分支中间特征提取模块(dual branch intermediate feature extraction module,DBFE),分支1采用金字塔结构并融合位置注意力模块提取多尺度上下文信息,分支2沿用金字塔结构融合双通道注意力机制使模型关注不同大小人头信息,最后使用1×1卷积生成密度图。实验方面,在ShanghaiTech PartA、ShanghaiTech PartB、Mall数据集上进行了算法对比实验,DBFE_MFNet模型在上述数据集的平均绝对误差和均方根误差分别为63.2、7.1、1.80和99.2、11.8、2.28,经对比实验分析,DBFE_MFNet模型具有不错的计数性能和稳定性能;在ShanghaiTech PartB进行了消融实验,实验验证了模型各模块的有效性。

关键词: 人群计数, VGG19, 编码-解码器, 残差卷积注意力模块, 双分支中间特征提取模块

Abstract: Crowd counting has important applications in public safety management, public space design, and other visual tasks such as behavior analysis and congestion analysis. However, the complexity of the background and the varying size of the head scale result in unsatisfactory crowd counting performance. To address the issues of scale changes and background interference in static images, a crowd counting network based on dual branch intermediate feature extraction is proposed. The network follows the encoder decoder structure and uses the first 16 layers of VGG19 convolutional neural network in the encoding stage. In order to better fuse multi-scale information, it replaces the last 4 convolutions of the first 16 layers of the VGG19 convolutional neural network with dilated convolutions with a vacancy rate of 2. The decoding part uses a residual convolutional attention module (RCAM) to suppress background interference, and inserts a dual branch intermediate feature extraction module (DBFE) in the middle of the encoder decoder structure. Branch 1 adopts a pyramid structure and integrates the position attention module to extract multi-scale contextual information, branch 2 follows a pyramid structure and integrates a dual channel attention mechanism to focus the model on different sizes of head information, and finally uses 1×1 generate density maps through convolution. In terms of experiments, algorithm comparison experiments are carried out on the data sets of ShanghaiTech PartA, ShanghaiTech PartB and Mall. The average absolute error and root mean square error of the model in the above data sets are 63.2, 7.1, 1.80 and 99.2, 11.8, 2.28, respectively. Through comparative experimental analysis, the model has good counting performance and stability. Ablation experiments are conducted on ShanghaiTech PartB, which verifies the effectiveness of each module of the model.

Key words: crowd counting, VGG19, encoder decoder, residual convolutional attention module (RCAM), dual branch intermediate feature extraction module (DBFE)