Survey of Transformer-Based Object Detection Algorithms

doi:10.3778/j.issn.1002-8331.2211-0133

Abstract

Abstract: Transformer is a kind of deep learning framework with strong modeling and parallel computing capabilities. At present, object detection algorithm based on Transformer has become a hotspot. In order to further explore new ideas and directions, this paper summarizes the existing object detection algorithm based on Transformer as well as a variety of object detection data sets and their application scenarios. This paper describes the correlation algorithms for Transformer based object detection from four aspects, i.e. feature extraction, object estimation, label matching policy and application of algorithm, compares the Transformer algorithm with the object detection algorithm based on convolutional neural network, analyzes the advantages and disadvantages of Transformer in object detection task, and proposes a general framework for Transformer based object detection model. Finally, the prospect of development trend of Transformer in the field of object detection is put forward.

Key words: Transformer, image processing, object detection, deep learning, convolutional neural network（CNN）

摘要： 深度学习框架Transformer具有强大的建模能力和并行计算能力，目前基于Transformer的目标检测算法已经成为研究的热点。为了进一步探索目标检测的新思路、新方向，对基于Transformer的目标检测算法进行了归纳总结。概述了多种目标检测数据集及其应用场景，从特征学习、目标估计、标签匹配策略和算法应用四方面梳理了Transformer目标检测的相关算法，并与基于卷积神经网络的目标检测算法进行对比，分析了Transformer在目标检测任务中的优点和局限性，并提出了Transformer目标检测模型的一般性框架。对Transformer在目标检测领域中的发展趋势进行了展望。

关键词: Transformer, 图像处理, 目标检测, 深度学习, 卷积神经网络（CNN）

LI Jian, DU Jianqiang, ZHU Yanchen, GUO Yongkun. Survey of Transformer-Based Object Detection Algorithms[J]. Computer Engineering and Applications, 2023, 59(10): 48-64.

李建, 杜建强, 朱彦陈, 郭永坤. 基于Transformer的目标检测算法综述[J]. 计算机工程与应用, 2023, 59(10): 48-64.

References

[1] 张璧程.基于区域卷积神经网络的目标检测与识别算法[D].成都：电子科技大学，2020.
ZHANG B C.A research of target detection and recognition algorithm with region-based convolutional neural network[D].Chengdu：University of Electronic Science and Technology of China，2020.
[2] 李柯泉，陈燕，刘佳晨，等.基于深度学习的目标检测算法综述[J].计算机工程，2022，48（7）：1-12.
LI K Q，CHEN Y，LIU J C，et al.Survey of deep learning-based object detection algorithms[J].Computer Engineering，2022，48（7）：1-12.
[3] GIRSHICK R，DONAHUE J，DARRELL T，et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition，2014：580-587.
[4] 刘泽凌.基于卷积神经网络的车道线检测与车道保持控制研究[D].哈尔滨：哈尔滨工业大学，2021.
LIU Z L.Research on lane line detection based on convolutional neural network and lane keeping control algorithm[D].Harbin：Harbin Institute of Technology，2021.
[5] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[J].arXiv：1706.03762，2017.
[6] CARION N，MASSA F，SYNNAEVE G，et al.End-to-end object detection with transformers[C]//16th European Conference on Computer Vision.Cham：Springer，2020：213-229.
[7] BEAL J，KIM E，TZENG E，et al.Toward transformer-based object detection[J].arXiv：2012.09958，2020.
[8] 刘文婷，卢新明.基于计算机视觉的Transformer研究进展[J].计算机工程与应用，2022，58（6）：1-16.
LIU W T，LU X M.Research progress of Transformer based on computer vision[J].Computer Engineering and Applications，2022，58（6）：1-16.
[9] 曹家乐，李亚利，孙汉卿，等.基于深度学习的视觉目标检测技术综述[J].中国图象图形学报，2022，27（6）：1697-1722.
CAO J L，LI Y L，SUN H Q，et al.A survey on deep learning based visual object detection[J].Journal of Image and Graphics，2022，27（6）：1697-1722.
[10] 刘洋，战荫伟.基于深度学习的小目标检测算法综述[J].计算机工程与应用，2021，57（2）：37-48.
LIU Y，ZHAN Y W.Survey of small object detection algorithms based on deep learning[J].Computer Engineering and Applications，2021，57（2）：37-48.
[11] ITO S，CHEN P，COMTE P，et al.Fabrication of screen printing pastes from TiO2 powders for dye sensitised solar cells[J].Progress in Photovoltaics：Research and Applications，2007，15（7）：603-612.
[12] MARRIS H，DEBOUDT K，AUGUSTIM P，et al.Fast changes in chemical composition and size distribution of fine particles during the near-field transport of industrial plumes[J].Science of the Total Environment，2012，427：126-138.
[13] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：common objects in context[C]//13th European Conference on Computer Vision.Cham：Springer，2014：740-755.
[14] KRASIN I，DUERIG T，ALLDRIN N，et al.OpenImages：a public dataset for large-scale multi-label and multi-class image classification[DB/OL].（2017）[2022-10-27].https：//github.com/openimages.
[15] XIA G S，BAI X，DING J，et al.DOTA：a large-scale dataset for object detection in aerial images[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：3974-3983.
[16] LOH Y P，CHAN C S.Getting to know low-light images with the exclusively dark dataset[J].Computer Vision and Image Understanding，2019，178：30-42.
[17] 田永林，王雨桐，王建功，等.视觉Transformer研究的关键问题：现状及展望[J].自动化学报，2022，48（4）：957-979.
TIAN Y L，WANG Y T，WANG J G，et al.Key problems and progress of vision Transformers：the state of the art and prospects[J].Acta Automatica Sinica，2022，48（4）：957-979.
[18] ZHU Z，LIANG D，ZHANG S，et al.Traffic-sign detection and classification in the wild[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：2110-2118.
[19] YU X，GONG Y，JIANG N，et al.Scale match for tiny person detetion[J].arXiv：1912.10664，2019.
[20] SHAO S，ZHAO Z，LI B，et al.CrowdHuman：a benchmark for detecting human in a crowd[J].arXiv：1805.00123，2018.
[21] HE K，GKIOXARI G，DOLLAR P，et al.Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision，Italy，2017：2980-2988．
[22] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16x16 words：Transformers for image recognition at scale[J].arXiv：2010.11929，2020.
[23] REN S，HE K，GIRSHICK R，et al.Faster R-CNN：towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，39（6）：1137-1149.
[24] LIU Z，LIN Y，CAO Y，et al.Swin Transformer：hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision，2021：10012-10022.
[25] LIN T Y，DOLLáR P，GIRSHICK R，et al.Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：2117-2125.
[26] RONNEBERGER O，FISCHER P，BROX T.U-net：convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2015：234-241.
[27] LIU Z，HU H，LIN Y，et al.Swin Transformer V2：scaling up capacity and resolution[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：12009-12019.
[28] WANG W，YAO L，CHEN L，et al.CrossFormer：a versatile vision transformer hinging on cross-scale attention[J].arXiv：2108.00154，2021.
[29] WANG W，XIE E，LI X，et al.Pyramid vision transformer：a versatile backbone for dense prediction without convolutions[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision，2021：568-578.
[30] LIN T Y，GOYAL P，GIRSHICK R，et al.Focal Loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision，2017：2999-3007.
[31] ZHOU D，KANG B，JIN X，et al.DeepVit：towards deeper vision transformer[J].arXiv：2103.11886，2021.
[32] GONG C，WANG D，LI M，et al.Improve vision transformers training by suppressing over-smoothing[J].arXiv：2104.12753，2021.
[33] ZHOU D Q，SHI Y J，KANG B Y，et al.Refiner：refining self-attention for vision transformers[J].arXiv：2106.03714，2021.
[34] ZHOU X Y，ZHUO J C，KRAHENBUHL P.Bottom-up object detection by grouping extreme and center points[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：850-859．
[35] ZHOU X，KOLTUN V，KRHENBüHL P.Probabilistic two-stage detection[J].arXiv：2103.07461，2021.
[36] CHI C，WEI F，HU H.RelationNet++：bridging visual representations for object detection via transformer decoder[C]//Advances in Neural Information Processing Systems 33，2020：13564-13574.
[37] TOUVRON H，CORD M，DOUZE M，et al.Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 38th International Conference on Machine Learning，2021：10347-10357.
[38] GUO J，HAN K，WU H，et al.CMT：convolutional neural networks meet vision transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：12175-12185.
[39] MEHTA S，RASTEGARI M.MobileViT：light-weight，general-purpose，and mobile-friendly vision transformer[J].arXiv：2110.02178，2021.
[40] LEE Y，KIM J，WILLETTE J，et al.MPViT：multi-path vision transformer for dense prediction[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：7287-7296.
[41] PENG Z，HUANG W，GU S，et al.Conformer：local features coupling global representations for visual recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision，2021：367-376.
[42] CHEN Y，DAI X，CHEN D，et al.Mobile-Former：bridging MobileNet and transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：5270-5279.
[43] PAN X，GE C，LU R，et al.On the integration of self-attention and convolution[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：815-825.
[44] SONG H，SUN D，CHUN S，et al.ViDT：an efficient and effective fully transformer-based object detector[J].arXiv：2110.03921，2021.
[45] CAI Z，VASCONCELOS N.Cascade R-CNN：delving into high quality object detection[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：6154-6162.
[46] SUN P，ZHANG R，JIANG Y，et al.Sparse R-CNN：end-to-end object detection with learnable proposals[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：14454-14463.
[47] DING L，LI H，HU C，et al.AlexNet feature extraction and multi-kernel learning for object-oriented classification[J].International Archives of the Photogrammetry，Remote Sen-
sing and Spatial Information Sciences，2018，42：277-281.
[48] SIMONYAN K，ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv：1409.
1556，2014.
[49] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[50] YAO Z，AI J，LI B，et al.Efficient DETR：improving end-to-end object detector with dense prior[J].arXiv：2104.
01318，2021.
[51] ZHU X，SU W，LU L，et al.Deformable DETR：deformable transformers for end-to-end object detection[J].arXiv：2010.04159，2020.
[52] SUN Z，CAO S，YANG Y，et al.Rethinking transformer-based set prediction for object detection[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision，2021：3611-3620.
[53] ZHANG G，LUO Z，YU Y，et al.Accelerating DETR convergence via semantic-aligned matching[J].arXiv：2203.
06883，2022.
[54] MENG D，CHEN X，FAN Z，et al.Conditional DETR for fast training convergence[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision，2021：3651-3660.
[55] DAI J，QI H，XIONG Y，et al.Deformable convolutional networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision，2017：764-773.
[56] LIU S，LI F，ZHANG H，et al.DAB-DETR：dynamic anchor boxes are better queries for DETR[J].arXiv：2201.12329，2022.
[57] GAO Z，WANG L，HAN B，et al.AdaMixer：a fast-converging query-based object detector[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：5364-5373.
[58] LI F，ZHANG H，LIU S，et al.DN-DETR：accelerate DETR training by introducing query denoising[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：13619-13627.
[59] HE K，FAN H，WU Y，et al.Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：9729-9738.
[60] NIIZUMI D，TAKEUCHI D，OHISHI Y，et al.BYOL for audio：self-supervised learning for general-purpose audio representation[C]//2021 International Joint Conference on Neural Networks，2021：1-8.
[61] DAI Z，CAI B，LIN Y，et al.UP-DETR：unsupervised pre-training for object detection with transformers[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：1601-1610.
[62] WANG Y，ZHANG X，YANG T，et al.Anchor DETR：query design for Transformer-based object detection[J].arXiv：2109.07107，2021.
[63] ZHENG M，GAO P，ZHANG R，et al.End-to-end object detection with adaptive clustering transformer[J].arXiv：2011.09315，2020.
[64] WANG T，YUAN L，CHEN Y，et al.PNP-DETR：towards efficient visual analysis with transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision，2021：4661-4670.
[65] ROH B，SHIN J W，SHIN W，et al.Sparse DETR：efficient end-to-end object detection with learnable sparsity[J].arXiv：2111.14330，2021.
[66] GUPTA A，NARAYAN S，JOSEPH K J，et al.OW-DETR：open-world detection transformer[J].arXiv：2112.01513，2021.
[67] JOSEPH K J，KHAN S，KHAN F S，et al.Towards open world object detection[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：5830-5840.
[68] ZHOU Z，ZHAO X，WANG Y，et al.CenterFormer：center-based transformer for 3D object detection[C]//17th European Conference on Computer Vision.Cham：Springer，2022：496-513.
[69] DENG S，LIANG Z，SUN L，et al.VISTA：boosting 3D object detection via dual cross-view SpaTial attention[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：8448-8457.
[70] WANG H，TANG J，LIU X，et al.PTSEFormer：progressive temporal-spatial enhanced TransFormer towards video object detection[C]//17th European Conference on Computer Vision.Cham：Springer，2022：732-747.
[71] CHENG X，XIONG H，FAN D P，et al.Implicit motion handling for video camouflaged object detection[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：13864-13873.
[72] 宁欣，田伟娟，于丽娜，等.面向小目标和遮挡目标检测的脑启发CIRA-DETR全推理方法[J].计算机学报，2022，45（10）：2080-2092.
NING X，TIAN W J，YU L N，et al.Brain-inspired CIRA-DETR full inference model for small and occluded object detection[J].Chinese Journal of Computers，2022，45（10）：2080-2092.
[73] XI E，BING S，JIN Y.Capsule network performance on complex data[J].arXiv：1712.03480，2017.
[74] 周静，胡怡宇，胡成玉，等.基于点云补全和多分辨Transformer的弱感知目标检测方法[J/OL].计算机应用（2022-10-12）[2022-12-10].http：//kns.cnki.net/kcms/detail/51.1307.TP.20221011.1028.002.html.
ZHOU J，HU Y Y，HU C Y，et al.Weakly perceived object detection method based on point cloud completion and multi-resolution Transformer[J/OL].Journal of Computer Applications（2022-10-12）[2022-12-10].http：//kns.cnki.net/kcms/detail/51.1307.TP.20221011.1028.002.html.
[75] KONG Q，WU Y，YUAN C，et al.CT-CAD：context-aware transformers for end-to-end chest abnormality detection on X-rays[C]//2021 IEEE International Conference on Bioinformatics and Biomedicine，2021：1385-1388.
[76] MA X，LUO G，WANG W，et al.Transformer network for significant stenosis detection in CCTA of coronary arteries[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：516-525.
[77] 谢光达，李洋，曲洪权，等.基于改进Transformer的小目标车辆精确检测算法[J].激光与光电子学进展，2022，59（18）：364-371.
XIE G D，LI Y，QU H Q，et al.Small target accurate vehicle detection algorithm based on improved Transformer[J].Laser & Optoelectronics Progress，2022，59（18）：364-371.
[78] 楼哲航，罗素云.基于YOLOX和Swin Transformer的车载红外目标检测[J].红外技术，2022，44（11）：1167-1175.
LOU Z H，LUO S Y.Vehicle infrared target detection based on YOLOX and Swin Transformer[J].Infrared Technology，2022，44（11）：1167-1175.
[79] 林文龙，阿里甫·库尔班，陈一潇，等.面向遥感影像目标检测的ACFEM-RetinaNet算法[J/OL].计算机工程与应用（2022-11-26）[2022-12-10].http：//kns.cnki.net/kcms/detail/11.2127.TP.20221125.1132.018.html.
LIN W L，ALIFU·KUERBAN，CHEN Y X，et al.ACFEM-RetinaNet algorithm for remote sensing image target detection[J/OL].Computer Engineering and Applications（2022-11-26）[2022-12-10].http：//kns.cnki.net/kcms/detail/11.2127.TP.20221125.1132.018.html..
[80] CHEN L C，PAPANDREOU G，KOKKINOS I，et al.DeepLab：semantic image segmentation with deep convolutional nets，atrous convolution，and fully connected CRFS[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，40（4）：834-848.
[81] XIE E，WANG W，YU Z，et al.SegFormer：simple and efficient design for semantic segmentation with transformers[C]//Advances in Neural Information Processing Systems 34，2021：12077-12090.
[82] 余同瑞，金冉，韩晓臻，等.自然语言处理预训练模型的研究综述[J].计算机工程与应用，2020，56（23）：12-22.
YU T R，JIN R，HAN X Z，et al.Review of pre-training models for natural language processing[J].Computer Engineering and Applications，2020，56（23）：12-22.
[83] WANG J，CHEN K，YANG S，et al.Region proposal by guided anchoring[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：2965-2974.
[84] MAHTO P，GARG P，SETH P，et al.Refining YOLOv4 for vehicle detection[J].International Journal of Advanced Research in Engineering and Technology，2020，11（5）：409-419.
[85] OCHER G.YOLOv5[EB/OL].（2020-08-10）[2022-10-18].https：//github.com/ultralytics/yolov5.
[86] LI Z，ZHOU F.FSSD：feature fusion single shot multibox detector[J].arXiv：1712.00960，2017.
[87] TIAN Z，SHEN C，CHEN H，et al.FCOS：fully convolutional one-stage object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision，2019：9627-9636.
[88] DONG Z，LI G，LIAO Y，et al.CentripetalNet：pursuing high-quality keypoint pairs for object detection[J].arXiv：2003.09119，2020.
[89] BAO W，YANG Y，LIANG D，et al.Multi-residual module stacked hourglass networks for human pose estimation[J].Journal of Beijing Institute of Technology，2020，29（1）：110-119.
[90] CHEN Q，WANG Y，YANG T，et al.You only look one-level feature[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：13039-13048.
[91] ISLAM M A，JIA S，BRUCE N D B.How much position information do convolutional neural networks encode?[J].arXiv：2001.08248，2020.
[92] 曾武，朱恒亮，邢树礼，等.显著性检测引导的图像数据增强方法[J/OL].图学学报（2022-09-14）[2022-10-27].http：//kns.cnki.net/kcms/detail/10.1034.T.20220913.1702.002.html.
ZENG W，ZHU H L，XING S L，et al.Saliency detection-guided for image data augmentation[J/OL].Journal of Graphics（2022-09-14）[2022-10-27].http：//kns.cnki.net/kcms/detail/10.1034.T.20220913.1702.002.html.
[93] 赵凯琳，靳小龙，王元卓.小样本学习研究综述[J].软件学报，2021，32（2）：349-369.
ZHAO K L，JIN X L，WANG Y Z.Survey on few-shot learning[J].Journal of Software，2021，32（2）：349-369.
[94] 张艳，张明路，吕晓玲，等.深度学习小目标检测算法研究综述[J].计算机工程与应用，2022，58（15）：1-17.
ZHANG Y，ZHANG M L，LYU X L，et al.Review of research on small target detection based on deep learning[J].Computer Engineering and Applications，2022，58（15）：1-17.
[95] TANG C，ZHAO Y，WANG G，et al.Sparse MLP for image recognition：is self-attention really necessary?[J].arXiv：2109.05422，2021.
[96] WANG G，ZHAO Y，TANG C，et al.When shift operation meets vision transformer：an extremely simple alternative to attention mechanism[J].arXiv：2201.10801，2022.
[97] RADFORD A，KIM J W，HALLACY C，et al.Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning，2021：8748-8763.
[98] RAMESH A，PAVLOV M，GOH G，et al.Zero-shot text-to-image generation[C]//Proceedings of the 38th International Conference on Machine Learning，2021：8821-8831.