Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (19): 92-98.DOI: 10.3778/j.issn.1002-8331.2206-0199

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Application of Improved Transformer Based on Weakly Supervised in Crowd Localization

GAO Hui, DENG Miaolei, ZHAO Wenjun, CHEN Faquan, ZHANG Dexian   

  1. 1.School of Mechanical and Electrical Engineering, Henan University of Technology, Zhengzhou 450001, China
    2.Henan International Joint Laboratory of Grain Information Processing, Zhengzhou 450001, China
    3.School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China
  • Online:2023-10-01 Published:2023-10-01

基于弱监督的改进Transformer在人群定位中的应用

高辉,邓淼磊,赵文君,陈法权,张德贤   

  1. 1.河南工业大学 机电工程学院,郑州 450001
    2.河南省粮食信息处理国际联合实验室,郑州 450001
    3.河南工业大学 信息科学与工程学院,郑州 450001

Abstract: Aiming to address the issue of complex preprocessing and post-processing required by existing crowd localization methods that employ pseudo bounding boxes or pre-designed localization maps, an end-to-end crowd localization network based on weakly supervised, LocalFormer, is proposed. In the feature extraction stage, a pure Transformer is used as the backbone network, and a global max-pooling operation is performed on the features of each stage to extract more comprehensive details of human heads. In the encoder-decoder stage, the positional information is embedded into the aggregated features as input to the encoder. Each decoder layer uses a set of trainable embeddings as queries, and takes visual features of the last layer of the encoder as keys and values. The decoded features are then used to predict confidence scores. Finally, a binary module is introduced with an adaptive optimized threshold learner to precisely binarize the confidence maps. Extensive experiments on three datasets in different environments show that the proposed method achieves the best positioning performance.

Key words: crowd localization, weakly supervised, convolutional neural network(CNN), global max pooling(GMP), vision Transformer(ViT)

摘要: 针对现有人群定位方法采用伪边界框或预先设计的定位图,需要复杂的预处理和后处理来获得头部位置的问题,提出一种基于弱监督的端到端人群定位网络LocalFormer。在特征提取阶段,将纯Transformer作为骨干网络,并对每个阶段的特征执行全局最大池化操作,提取更加丰富的人头细节信息。在编码器-解码器阶段,将聚合特征嵌入位置信息作为编码器的输入,且每个解码器层采用一组可训练嵌入作为查询,并将编码器最后一层的视觉特征作为键和值,解码后的特征用于预测置信度得分。通过二值化模块自适应优化阈值学习器,从而精确地二值化置信度图。在不同数据环境下对三个数据集进行实验,结果表明该方法实现了最佳定位性能。

关键词: 人群定位, 弱监督, 卷积神经网络, 全局最大池化, 视觉Transformer