基于深度学习和Transformer的目标检测算法

doi:10.3778/j.issn.1002-8331.2205-0354

摘要/Abstract

摘要： 目标检测是实现目标跟踪、实例分割等高级视觉任务的基础，在智慧交通、缺陷检测、智能安防等现实场景有着重要应用。现有高精度检测算法都是在深度学习的指导下实现，同时伴有锚框技术，但是锚框自身的不足对检测器性能有着较大影响，无锚点碰撞检测成为了近几年目标检测领域新的研究方向。与此同时，Transformer表现出的巨大潜力为视觉领域开辟了图像与Transformer结合这个新方向，基于Transformer的目标检测也成为一个新的研究热点。系统地总结了深度学习时代的目标检测算法，调查并研究了近五年目标检测的相关论文，重点从Anchor-free和Transformer两个角度对这些算法进行深入分析，介绍了这些算法在现实场景具体应用情况以及目标检测领域常用数据集，基于目前的研究现状对目标检测的未来可研究方向进行了展望。

关键词: 计算机视觉, 目标检测, 无锚检测, Transformer

Abstract: Object detection is the basis for advanced vision tasks such as object tracking and instance segmentation, and has important applications in real-world scenarios such as intelligent transportation, defect detection, and intelligent security. Existing high-precision detection algorithms are all implemented under the guidance of deep learning, accompanied by Anchor frame technology. However, the shortcomings of the anchor frame itself have a great impact on the performance of the detector. Anchor-free collision detection has become a target detection method in recent years. new research directions in the field. At the same time, the great potential shown by Transformer has opened up a new direction of combining image and Transformer for the field of vision, and Transformer-based target detection has also become a new research hotspot. This paper systematically summarizes the target detection algorithms in the deep learning era, investigates and studies related papers on target detection in the past five years, focuses on in-depth analysis of these algorithms from the perspectives of Anchor-free and Transformer, and introduces the specific application situation of these algorithms in real scenarios and the commonly used datasets in the field of target detection. Finally, based on the current research status, the future research directions of target detection are prospected.

Key words: computer vision, object detection, Anchor-free detection, Transformer

付苗苗, 邓淼磊, 张德贤. 基于深度学习和Transformer的目标检测算法[J]. 计算机工程与应用, 2023, 59(1): 37-48.

FU Miaomiao, DENG Miaolei, ZHANG Dexian. Object Detection Algorithms Based on Deep Learning and Transformer[J]. Computer Engineering and Applications, 2023, 59(1): 37-48.

参考文献

[1] KRIZHEVSKY A，SUTSKEVER I，HINTON G E.Imagenet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems，2012：1097-1105.
[2] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017.
[3] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16×16 words：Transformers for image recognition at scale[J].arXiv：2010.11929，2020.
[4] 刘文婷，卢新明.基于计算机视觉的Transformer研究进展[J].计算机工程与应用，2022，58（6）：1-16.
LIU W T，LU X M.Research progress of Transformer based on computer vision[J].Computer Engineering and Applications，2022，58（6）：1-16.
[5] GIRSHICK R，DONAHUE J，DARRELL T，et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2014：580-587.
[6] HE K M，ZHANG X Y，REN S Q，et al.Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2015，37（9）：1904-1916.
[7] GIRSHICK R.Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：1440-1448.
[8] REN S Q，HE K M，GIRSHICK R，et al.Faster R-CNN：Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，39（6）：1137-1149.
[9] DAI J，LI Y，HE K，et al.R-FCN：Object detection via region-based fully convolutional networks[C]//Advances in Neural Information Processing Systems，2016.
[10] HE K，GKIOXARI G，DOLLáR P，et al.Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2961-2969.
[11] CAI Z W，VASCONCELOS N.Cascade R-CNN：Delving into high quality object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2018：6154-6162.
[12] LIU W，ANGUELOV D，ERHAN D，et al.SSD：Single shot multibox detector[C]//Proceedings of the European Conference on Computer Vision，2016：21-37.
[13] FU C Y，LIU W，RANGA A，et al.DSSD：Deconvolutional single shot detector[J].arXiv：1701.06659，2017.
[14] LI Z，ZHOU F.FSSD：Feature fusion single shot multibox detector[J].arXiv：1712.00960，2017.
[15] JEONG J，PARK H，KWAK N.Enhancement of SSD by concatenating feature maps for object detection[J].arXiv：1705.09587，2017.
[16] REDMON J，FARHADI A.YOLO9000：Better，faster，stronger[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：6517-6525.
[17] REDMON J，FARHADI A.YOLOv3：An incremental improvement[J].arXiv：1804.02767，2018.
[18] BOCHKOVSKIY A，WANG C Y，LIAO H Y M.YOLOv4：Optimal speed and accuracy of object detection[J].arXiv：2004.10934，2020.
[19] LIN T Y，GOYAL P，GIRSHICK R，et al.Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2999-3007.
[20] HUANG L，YANG Y，DENG Y，et al.Densebox：Unifying landmark localization with end to end object detection[J].arXiv：1509.04874，2015.
[21] REDMON J，DIVVALA S，GIRSHICK R，et al.You only look once：Unified，real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：779-788.
[22] NEWELL A，YANG K，DENG J.Stacked hourglass networks for human pose estimation[C]//Proceedings of the European Conference on Computer Vision，2016：483-499.
[23] LAW H，DENG J.CornerNet：Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision，2018：734-750.
[24] LAW H，TENG Y，RUSSAKOVSKY O，et al.Cornernet-lite：Efficient keypoint based object detection[J].arXiv：1904.08900，2019.
[25] IANDOLA F N，HAN S，MOSKEWICZ M W，et al.SqueezeNet：AlexNet-level accuracy with 50x fewer parameters and <0.5?MB model size[J].arXiv：1602.07360，2016.
[26] HOWARD A G，ZHU M，CHEN B，et al.MobileNets：Efficient convolutional neural networks for mobile vision applications[J].arXiv：1704.04861，2017.
[27] DUAN K W，BAI S，XIE L X，et al.CenterNet：Keypoint triplets for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：6568-6577.
[28] DONG Z W，LI G X，LIAO Y，et al.CentripetalNet：Pursuing high-quality keypoint pairs for object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：10516-10525.
[29] ZHOU X Y，ZHUO J C，KR?HENBüHL P.Bottom-up object detection by grouping extreme and center points[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：850-859.
[30] PAPADOPOULOS D P，UIJLINGS J R R，KELLER F，et al.Extreme clicking for efficient object annotation[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：4940-4949.
[31] TIAN Z，SHEN C H，CHEN H，et al.FCOS：Fully convolutional one-stage object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：9626-9635.
[32] LIN T Y，DOLLáR P，GIRSHICK R，et al.Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：936-944.
[33] KONG T，SUN F C，LIU H P，et al.FoveaBox：Beyound anchor-based object detection[J].IEEE Transactions on Image Processing，2020，29：7389-7398.
[34] ZHANG S，CHI C，YAO Y，et al.Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：9759-9768.
[35] 伏轩仪，张銮景，梁文科，等.锚点机制在目标检测领域的发展综述[J].计算机科学与探索，2022，16（4）：791-805.
FU X Y，ZHANG L J，LIANG W K，et al.Review on development of anchor mechanism in object detection[J].Journal of Frontiers of Computer Science and Technology，2022，16（4）：791-805.
[36] SUN P，JIANG Y，XIE E，et al.Onenet：Towards end-to-end one-stage object detection[J].arXiv：2012.05780，2020.
[37] SUN P，ZHANG R，JIANG Y，et al.Sparse R-CNN：End-to-end object detection with learnable proposals[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：14454-14463.
[38] ZHU C，HE Y，SAVVIDES M.Feature selective anchor-free module for single-shot object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：840-849.
[39] CARION N，MASSA F，SYNNAEVE G，et al.End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision，2020：213-229.
[40] ZHU X，SU W，LU L，et al.Deformable DETR：Deformable transformers for end-to-end object detection[J].arXiv：2010.04159，2020.
[41] DAI J F，QI H Z，XIONG Y W，et al.Deformable convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：764-773.
[42] SUN Z，CAO S，YANG Y，et al.Rethinking transformer-based set prediction for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：3611-3620.
[43] ZHENG M，GAO P，ZHANG R，et al.End-to-end object detection with adaptive clustering transformer[J].arXiv：2011.09315，2020.
[44] DAI Z G，CAI B L，LIN Y G，et al.UP-DETR：Unsupervised pre-training for object detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：1601-1610.
[45] LIU S，QI L，QIN H，et al.Path aggregation network for instance segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：8759-8768.
[46] TAN M，PANG R，LE Q V.Efficientdet：Scalable and efficient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：10781-10790.
[47] ZHANG D，ZHANG H，TANG J，et al.Feature pyramid transformer[C]//Proceedings of the European Conference on Computer Vision，2020：323-339.
[48] LIU Z，LIN Y，CAO Y，et al.Swin transformer：Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：10012-10022.
[49] LIU Z，HU H，LIN Y，et al.Swin Transformer v2：Scaling up capacity and resolution[J].arXiv：2111.09883，2021.
[50] WANG H，ZHU Y，ADAM H，et al.MaX-DeepLab：End-to-end panoptic segmentation with mask transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：5463-5474.
[51] WANG Y，XU Z，WANG X，et al.End-to-end video instance segmentation with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：8741-8750.
[52] LIN M，LI C，BU X，et al.DETr for pedestrian detection[J].arXiv：2012.06785，2020.
[53] LIU R J，YUAN Z J，LIU T，et al.End-to-end lane shape prediction with transformers[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision，2021：3693-3701.
[54] HUANG L，TAN J，LIU J，et al.Hand-transformer：Non-autoregressive structured modeling for 3D hand pose estimation[C]//Proceedings of the European Conference on Computer Vision，2020：17-33.
[55] LIN K，WANG L，LIU Z.End-to-end human pose and mesh reconstruction with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：1954-1963.
[56] CAO H，WANG Y，CHEN J，et al.Swin-UNet：UNet-like pure transformer for medical image segmentation[J].arXiv：2105.05537，2021.
[57] GUO M H，CAI J X，LIU Z N，et al.PCT：Point cloud Transformer[J].Computational Visual Media，2021，7（2）：187-199.
[58] 奉志强，谢志军，包正伟，等.基于改进YOLOv5的无人机实时密集小目标检测算法[J/OL].航空学报：1-15[2022-05-10].http：//kns.cnki.net/kcms/detail/11.1929.V.20220509.
2316.010.html.
FENG Z Q，XIE Z J，BAO Z W，et al.UAV real-time dense small target detection algorithm based on improved YOLOv5[J/OL].Journal of Aeronautics and Astronautics：1-15[2022-05-10].http：//kns.cnki.net/kcms/detail/11.1929.V.20220509.2316.010.html.
[59] YAO S B，ZHU Q Y，ZHANG T，et al.Infrared image small-target detection based on improved FCOS and spatio-temporal features[J].Electronics，2022，11（6）：933.
[60] 陈永，王镇，卢晨涛，等.红外弱光下多特征与注意力增强铁路异物检测[J/OL].北京航空航天大学学报：1-15[2022-05-10].DOI：10.13700/j.bh.1001-5965.2021.0591.
CHEN Y，WANG Z，LU C T，et al.Multi-feature and attention-enhanced railway foreign object detection under low infrared light[J/OL].Journal of Beijing University of Aeronautics and Astronautics：1-15[2022-05-10].DOI：10.13700/j.bh.1001-5965.2021.0591.
[61] 张乃雪，钟羽中，赵涛，等.基于Smooth-DETR的产品表面小尺寸缺陷检测算法[J].计算机应用研究，2022，39（8）：2520-2525.
ZHANG N X，ZHONG Y Z，ZHAO T，et al.Detection method for small-size surface defects based on Smooth-DETR[J].Application Research of Computers，2022，39（8）：2520-2525.
[62] 高钦泉，黄炳城，刘文哲，等.基于改进CenterNet的竹条表面缺陷检测方法[J].计算机应用，2021，41（7）：1933-1938.
GAO Q Q，HUANG B C，LIU W Z，et al.Bamboo strip surface defect detection method based on improved CenterNet[J].Journal of Computer Applications，2021，41（7）：1933-1938.
[63] 何林远，白俊强，贺旭，等.基于稀疏Transformer的遥感旋转目标检测[J/OL].激光与光电子学进展：1-17[2022-05-10].http：//kns.cnki.net/kcms/detail/31.1690.TN.20210927.
1006.002.html.
HE L Y，BAI J Q，HE X，et al.Remote sensing rotating target detection based on sparse Transformer[J/OL].Progress in Laser and Optoelectronics：1-17[2022-05-10].http：//kns.cnki.net/kcms/detail/31.1690.TN.20210927.1006.
002.html.
[64] 韩磊，高永彬，史志才.基于稀疏Transformer的雷达点云三维目标检测[J/OL].计算机工程：1-10[2022-05-10].DOI：10.19678/j.issn.1000-3428.0062440.
HAN L，GAO Y B，SHI Z C.3D target detection of radar point cloud based on sparse Transformer[J/OL].Computer Engineering：1-10[2022-05-10].DOI：10.19678/j.issn.1000-3428.0062440.
[65] NAWAZ M，NAZIR T，MASOOD M，et al.Analysis of brain MRI images using improved CornerNet approach[J].Diagnostics，2021，11（10）：1856.
[66] 汤寓麟，李厚朴，张卫东，等.侧扫声纳检测沉船目标的轻量化DETR-YOLO法[J].系统工程与电子技术，2022，44（8）：2427-2436.
TANG Y L，LI H P，ZHANG W D，et al.Lightweight DETR-YOLO method for detecting shipwreck target in side-scan sonar[J].Systems Engineering and Electronics，2022，44（8）：2427-2436.