Research on Small Object Detection Method of Improved RT-DETR

doi:10.3778/j.issn.1002-8331.2501-0293

Abstract

Abstract: To address the challenges of severe background interference and insufficient feature representation in small object detection within complex scenarios, an improved RT-DETR-based model, DA-DETR, is proposed. A multi-order gated aggregation block is introduced into the backbone network to enhance the distinction between local and global features, enabling the detector to better differentiate foreground objects from noisy backgrounds. The convolutional additive token mixer is incorporated to reduce feature loss and improve the integration of global and local information. Finally, an improved loss function, CoreProximity-IoU, is designed to be more sensitive to IoU variations in small object detection. Experimental results demonstrate that the DA-DETR model achieves a 2.8 and 2.3 percentage points improvement in mAP@50 and mAP@50：95, respectively, on the VisDrone2019 dataset. On the KITTI dataset, mAP@50 and mAP@50：95 increase by 0.6 and 0.4 percentage points, respectively, compared to RT-DETR. Additionally, the model significantly reduces computational complexity and parameter count, further validating its effectiveness and superiority.

Key words: small target detection, RT-DETR, complex scenes, background interference

摘要： 针对复杂场景小目标检测中存在的背景干扰严重、特征表达能力不足等问题，提出了一种基于改进RT-DETR的小目标检测模型DA-DETR。在骨干网络中引入了一种多阶门控聚合模块（multi-order gated aggregation block），通过增强局部与全局特征的差异性使目标检测器能更好地区分前景物体和嘈杂背景。引入了卷积加性标记混合器（convolutional additive token mixer，CATM），进一步减少了特征丢失，提升了模型的全局与局部信息整合能力。提出了一种改进的损失函数CoreProximity-IoU，其对于小目标检测的IoU变化更敏感。实验结果表明，DA-DETR模型在VisDrone2019数据集上的mAP@50和mAP@50：95分别提升了2.8和2.3个百分点，在KITTI数据集上的mAP@50和mAP@50：95分别比RT-DETR提升了0.6和0.4个百分点。此外，模型计算量和参数量均有显著的减少，进一步验证了所提出方法的有效性和优越性。

关键词: 小目标检测, RT-DETR, 复杂场景, 背景干扰

CHENG Xinmiao, ZHANG Xuesong, CAO Bingjie, SONG Cunli. Research on Small Object Detection Method of Improved RT-DETR[J]. Computer Engineering and Applications, 2025, 61(15): 144-155.

程鑫淼, 张雪松, 曹冰洁, 宋存利. 改进RT-DETR的小目标检测方法研究[J]. 计算机工程与应用, 2025, 61(15): 144-155.

References

[1] MAHAUR B, MISHRA K K, KUMAR A. An improved lightweight small object detection framework applied to real-time autonomous driving[J]. Expert Systems with Applications, 2023, 234: 121036.
[2] SUN W, DAI L, ZHANG X R, et al. RSOD: real-time small object detection algorithm in UAV-based traffic monitoring[J]. Applied Intelligence, 2022, 52(8): 8448-8463.
[3] 姜贸翔, 司占军, 王晓喆. 改进RT-DETR的无人机图像目标检测算法[J]. 计算机工程与应用, 2025, 61(1): 98-108.
JIANG M X, SI Z J, WANG X Z. Improved target detection algorithm for UAV images with RT-DETR[J]. Computer Engineering and Applications, 2025, 61(1): 98-108.
[4] 赵其昌, 吴一全, 苑玉彬. 光学遥感图像舰船目标检测与识别方法研究进展[J]. 航空学报, 2024, 45(8): 029025.
ZHAO Q C, WU Y Q, YUAN Y B. Progress of ship detection and recognition methods in optical remote sensing images[J]. Acta Aeronautica et Astronautica Sinica, 2024, 45(8): 029025.
[5] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 580-587.
[6] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1440-1448.
[7] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137-1149.
[8] HE K M, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2980-2988.
[9] CAI Z W, VASCONCELOS N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6154-6162.
[10] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2999-3007.
[11] LI X, WANG W, WU L, et al. Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection[C]//Advances in Neural Information Processing Systems, 2020: 21002-21012.
[12] FENG C J, ZHONG Y J, GAO Y, et al. TOOD: task-aligned one-stage object detection[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 3490-3499.
[13] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 779-788.
[14] 何湘杰, 宋晓宁. YOLOv4-Tiny的改进轻量级目标检测算法[J]. 计算机科学与探索, 2024, 18(1): 138-150.
HE X J, SONG X N. Improved YOLOv4-Tiny lightweight target detection algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 138-150.
[15] 王春梅, 刘欢. YOLOv8-VSC: 一种轻量级的带钢表面缺陷检测算法[J]. 计算机科学与探索, 2024, 18(1): 151-160.
WANG C M, LIU H. YOLOv8-VSC: lightweight algorithm for strip surface defect detection[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 151-160.
[16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[17] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 213-229.
[18] ZHU X Z, SU W J, LU L W, et al. Deformable DETR: deformable transformers for end-to-end object detection[J]. arXiv:2010.04159, 2020.
[19] LIU S L, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[J]. arXiv:2201.12329, 2022.
[20] ZHANG S L, WANG X J, WANG J Q, et al. Dense distinct query for end-to-end object detection[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 7329-7338.
[21] 胡佳乐, 周敏, 申飞. 面向无人机小目标的RTDETR改进检测算法[J]. 计算机工程与应用, 2024, 60(20): 198-206.
HU J L, ZHOU M, SHEN F. Improved detection algorithm of RTDETR for UAV small target[J]. Computer Engineering and Applications, 2024, 60(20): 198-206.
[22] ZHAO Y A, LV W Y, XU S L, et al. DETRs beat YOLOs on real-time object detection[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 16965-16974.
[23] ZHANG H, HAO C Y, SONG W R, et al. Adaptive slicing-aided hyper inference for small object detection in high-resolution remote sensing images[J]. Remote Sensing, 2023, 15(5): 1249.
[24] CAO Y R, HE Z J, WANG L J, et al. VisDrone-DET2021: the vision meets drone object detection challenge results[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2021: 2847-2854.
[25] GEIGER A, LENZ P, STILLER C, et al. Vision meets robotics: the KITTI dataset[J]. The International Journal of Robotics Research, 2013, 32(11): 1231-1237.
[26] 程旭, 宋晨, 史金钢, 等. 基于深度学习的通用目标检测研究综述[J]. 电子学报, 2021, 49(7): 1428-1438.
CHENG X, SONG C, SHI J G, et al. A survey of generic object detection methods based on deep learning[J]. Acta Electronica Sinica, 2021, 49(7): 1428-1438.
[27] ZOU Z X, CHEN K Y, SHI Z W, et al. Object detection in 20 years: a survey[J]. Proceedings of the IEEE, 2023, 111(3): 257-276.
[28] CHOLLET F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1800-1807.
[29] LI S, WANG Z, LIU Z, et al. MogaNet: multi-order gated aggregation network[J]. arXiv:2211.03295, 2022.
[30] WANG C Y, MARK LIAO H Y, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2020: 1571-1580.
[31] ZHANG T, LI L, ZHOU Y, et al. CAS-ViT: convolutional additive self-attention vision transformers for efficient mobile applications[J]. arXiv:2408.03703, 2024.
[32] LIU Q K, LIU R, ZHENG B L, et al. Infrared small target detection with scale and location sensitivity[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 17490-17499.
[33] MA S L, XU Y, MA S L, et al. MPDIoU: a loss for efficient and accurate bounding box regression[J]. arXiv:2307.07662, 2023.
[34] ZHANG H, XU C, ZHANG S J. Inner-IoU: more effective intersection over union loss with auxiliary bounding box[J]. arXiv:2311.02877, 2023.
[35] YU X H, GONG Y Q, JIANG N, et al. Scale match for tiny person detection[C]//Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2020: 1246-1254.
[36] XIA G S, BAI X, DING J, et al. DOTA: a large-scale dataset for object detection in aerial images[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 3974-3983.
[37] 韩佰轩, 彭月平, 郝鹤翔, 等. DMU-YOLO: 机载视觉的多类异常行为检测算法[J]. 计算机工程与应用, 2025, 61(7): 128-140.
HAN B X, PENG Y P, HAO H X, et al. DMU-YOLO: multi-class abnormal behavior detection algorithm based on air-borne vision[J]. Computer Engineering and Applications, 2025, 61(7): 128-140.
[38] HUANG S, LU Z, CUN X, et al. DEIM: DETR with improved matching for fast convergence[J]. arXiv:2412.04234, 2024.
[39] WANG C Y, YEH I H, MARK LIAO H Y. YOLOv9: learning what you want to learn using programmable gradient information[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 1-21.
[40] WANG A, CHEN H, LIU L, et al. YOLOv10: real-time end-to-end object detection[C]//Advances in Neural Information Processing Systems, 2024: 107984-108011.
[41] CHATTOPADHAY A, SARKAR A, HOWLADER P, et al. Grad-CAM: generalized gradient-based visual explanations for deep convolutional networks[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2018: 839-847.
[42] ZHENG Q H, SAPONARA S, TIAN X Y, et al. A real-time constellation image classification method of wireless communication signals based on the lightweight network Mobile-ViT[J]. Cognitive Neurodynamics, 2024, 18(2): 659-671.
[43] ZHENG Q H, TIAN X Y, YU Z G, et al. Robust automatic modulation classification using asymmetric trilinear attention net with noisy activation function[J]. Engineering Applications of Artificial Intelligence, 2025, 141: 109861.