Application of Improved Transformer Based on Weakly Supervised in Crowd Localization

doi:10.3778/j.issn.1002-8331.2206-0199

Abstract

Abstract: Aiming to address the issue of complex preprocessing and post-processing required by existing crowd localization methods that employ pseudo bounding boxes or pre-designed localization maps, an end-to-end crowd localization network based on weakly supervised, LocalFormer, is proposed. In the feature extraction stage, a pure Transformer is used as the backbone network, and a global max-pooling operation is performed on the features of each stage to extract more comprehensive details of human heads. In the encoder-decoder stage, the positional information is embedded into the aggregated features as input to the encoder. Each decoder layer uses a set of trainable embeddings as queries, and takes visual features of the last layer of the encoder as keys and values. The decoded features are then used to predict confidence scores. Finally, a binary module is introduced with an adaptive optimized threshold learner to precisely binarize the confidence maps. Extensive experiments on three datasets in different environments show that the proposed method achieves the best positioning performance.

Key words: crowd localization, weakly supervised, convolutional neural network（CNN）, global max pooling（GMP）, vision Transformer（ViT）

摘要： 针对现有人群定位方法采用伪边界框或预先设计的定位图，需要复杂的预处理和后处理来获得头部位置的问题，提出一种基于弱监督的端到端人群定位网络LocalFormer。在特征提取阶段，将纯Transformer作为骨干网络，并对每个阶段的特征执行全局最大池化操作，提取更加丰富的人头细节信息。在编码器-解码器阶段，将聚合特征嵌入位置信息作为编码器的输入，且每个解码器层采用一组可训练嵌入作为查询，并将编码器最后一层的视觉特征作为键和值，解码后的特征用于预测置信度得分。通过二值化模块自适应优化阈值学习器，从而精确地二值化置信度图。在不同数据环境下对三个数据集进行实验，结果表明该方法实现了最佳定位性能。

关键词: 人群定位, 弱监督, 卷积神经网络, 全局最大池化, 视觉Transformer

GAO Hui, DENG Miaolei, ZHAO Wenjun, CHEN Faquan, ZHANG Dexian. Application of Improved Transformer Based on Weakly Supervised in Crowd Localization[J]. Computer Engineering and Applications, 2023, 59(19): 92-98.

高辉, 邓淼磊, 赵文君, 陈法权, 张德贤. 基于弱监督的改进Transformer在人群定位中的应用[J]. 计算机工程与应用, 2023, 59(19): 92-98.

References

[1] LIU Y，SHI M，ZHAO Q，et al.Point in，box out：beyond counting persons in crowds[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2019：6462-6471.
[2] SAM D B，PERI S V，NARAYANAN SUNDARARAMAN M，et al.Locate，size，and count：accurately resolving people in dense crowds via detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，43：2739-2751.
[3] WANG Y，HOU J，HOU X，et al.A self-training approach for point-supervised object detection and counting in crowds[J].IEEE Transactions on Image Processing，2021，30：2876-2887.
[4] REN S，HE K，GIRSHICK R B，et al.Faster R-CNN：towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2015，39：1137-1149.
[5] IDREES H，TAYYAB M，ATHREY K，et al.Composition loss for counting，density map estimation and localization in dense crowds[J].arXiv：1808.01050，2018.
[6] GAO J，HAN T，WANG Q，et al.Domain-adaptive crowd counting via inter-domain features segregation and Gaussian-prior Reconstruction[J].arXiv：1912.03677，2019.
[7] XU C，LIANG D，XU Y，et al.AutoScale：learning to scale for crowd counting[J].arXiv：1912.09632，2019.
[8] LIANG D，XU W，ZHU Y，et al.Focal inverse distance transform maps for crowd localization and counting in dense crowd[J].arXiv：2102.07925，2021.
[9] GAO J，HAN T，YUAN Y，et al.Learning independent instance maps for crowd localization[J].arXiv：2012. 04164，2020.
[10] SONG Q，WANG C，JIANG Z，et al.Rethinking counting and localization in crowds：a purely point-based framework[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision（ICCV），2021：3345-3354.
[11] CARION N，MASSA F，SYNNAEVE G，et al.End-to-end object detection with transformers[J].arXiv：2005.12872，2020.
[12] MENG D，CHEN X，FAN Z，et al.Conditional DETR for fast training convergence[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision（ICCV），2021：3631-3640.
[13] TOUVRON H，CORD M，DOUZE M，et al.Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 37th International Conference on Machine Learning（ICML），2020：10347-10357.
[14] WANG W，XIE E，LI X，et al.Pyramid vision transformer：a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision（ICCV），2021：548-558.
[15] LIANG D，CHEN X，XU W，et al.TransCrowd：weakly-supervised crowd counting with transformer[J].arXiv：2104.09116，2021.
[16] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16×16 words：transformers for image recognition at scale[J].arXiv：2010.11929，2020.
[17] SUN G，LIU Y，PROBST T，et al.Boosting crowd counting with transformers[J].arXiv：2105.10926，2021.
[18] CHAN A B，LIANG Z S J，VASCONCELOS N.Privacy preserving crowd monitoring：counting people without people models or tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2008：1-7.
[19] LEI Y，LIU Y，ZHANG P，et al.Towards using count-level weak supervision for crowd counting[J].Pattern Recognition，2021，109：107616.
[20] BORSTEL M V，KANDEMIR M，SCHMIDT P，et al.Gaussian process density counting from weak supervision[C]//Proceedings of the European Conference on Computer Vision（ECCV），2016：365-380.
[21] YANG Y，WU Z，SU L，et al.Weakly-supervised crowd counting learns from sorting rather than locations[C]//Proceedings of European Conference on Computer Vision（ECCV），2020：1-17.
[22] TIAN Y，CHU X，WANG H.CCTrans：simplifying and improving crowd counting with transformer[J].arXiv：2109.14483，2021.
[23] CHU X，TIAN Z，WANG Y，et al.Twins：revisiting the design of spatial attention in vision transformers[J]. arXiv：2104.13840，2021.
[24] WANG W，XIE E，LI X，et al.PVTv2：improved baselines with pyramid vision transformer[J].Computational Visual Media，2022，8：415-424.
[25] ABOUSAMRA S，HOAI M，SAMARAS D，et al.Localization in the crowd with topological constraints[J].arXiv：2012.12482，2020.
[26] ZHANG Y，ZHOU D，CHEN S，et al.Single-image crowd counting via multi-column convolutional neural network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2016：589-597.
[27] WANG Q，GAO J，LIN W，et al.NWPU-crowd：a large-scale benchmark for crowd counting and localization[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，43：2141-2149.
[28] KINGMA D P，BA J.ADAM：a method for stochastic optimization[J].arXiv：1412.6980，2014.
[29] LIU C，WENG X，MU Y.Recurrent attentive zooming for joint crowd counting and precise localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2019：1217-1226.
[30] RIBERA J，GUERA D，CHEN Y，et al.Locating objects without bounding boxes[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2019：6472-6482.
[31] LARADJI I H，ROSTAMZADEH N，PINHEIRO P H O，et al.Where are the blobs：counting by localization with point supervision[J].arXiv：1807.09856，2018.