InternDiffuseDet: Object Detection Method Combining Deformable Convolution and Diffusion Model

doi:10.3778/j.issn.1002-8331.2309-0272

Abstract

Abstract: The paper focuses on the topic of object detection and aims to address issues such as missed detections, limited feature extraction capability, and low detection accuracy in complex scenes. Building upon DiffusionDet, a modified approach is proposed that combines deformable convolutions and diffusion models for object detection. The core idea is to increase the quantity and quality of feature maps before entering the detection head. This is achieved by introducing InternImage and DCNv3 deformable convolution operators into the backbone network, enhancing the receptive field and non-linear modeling capability of the model. An improved feature pyramid network (CS-FPN) based on selective weighting is proposed to enhance the intermediate FPN feature pyramids. Channel and spatial separations are achieved using depthwise separable convolutions, with the traditional upsampling operation being replaced by the CARAFE operator to improve resolution and semantic information transfer. Following that, the SGE attention mechanism is employed to reassemble the feature maps, ensuring the preservation of hierarchical information during diffusion. Prior to entering the detection head, the DDIM diffusion operation is performed to obtain feature maps at different time steps, thereby augmenting the quantity of detection feature maps. Finally, the EIOU algorithm is introduced in target box matching and loss functions to handle position deviations and scale differences between target boxes. Experimental results on the COCO dataset and road detection dataset demonstrate that the improved model is 3.8 and 3.6 percentage points higher than the original model, respectively, in the same experimental settings. These results indicate the potential of the proposed method to enhance the accuracy and robustness of object detection, providing new insights and approaches for addressing object detection challenges in real-world scenarios.

Key words: DiffusionDet, deformable convolution, diffusion model, feature pyramid, loss function

摘要： 针对现有目标检测中存在的漏检和误检、特征提取能力有限、处理复杂场景时检测精度不高等问题，基于DiffusionDet进行改进，提出了一种结合可变形卷积和扩散模型的目标检测方法。以模型在进入检测头之前需要更多且优质的特征图为核心思想，在主干网络中引入InternImage和DCNv3可变形卷积算子提升模型的感受野和非线性建模能力。对中间层的FPN特征金字塔进行改进，设计了一种基于选择性加权的特征金字塔CS-FPN；利用深度可分离卷积实现通道和区域的分离，同时采用CARAFE算子替代传统的上采样操作，提高分辨率和语义信息的传递；随后利用SGE注意力机制对特征图进行重组，以确保特征图在扩散的过程中保留更多的层次化信息。在特征图进入检测头之前，进行DDIM的扩散操作，获得不同时刻的特征图，以扩充检测特征图的数量。最后在目标框匹配和损失函数方面采用EIOU算法以处理目标框之间的位置偏移和尺度差异。实验数据显示，在COCO数据集和道路检测数据集上，改进后的模型在相同的实验环境下比原有模型分别提升了3.8和3.6个百分点。实验结果表明该方法在提高目标检测的准确性和鲁棒性方面具有一定的潜力，并为解决现实场景中的目标检测问题提供了新的思路和方法。

关键词: DiffusionDet, 可变形卷积, 扩散模型, 特征金字塔, 损失函数

YUAN Zhixiang, GAO Yongqi. InternDiffuseDet: Object Detection Method Combining Deformable Convolution and Diffusion Model[J]. Computer Engineering and Applications, 2024, 60(12): 203-215.

袁志祥, 高永奇. InternDiffuseDet:结合可变形卷积和扩散模型的目标检测方法[J]. 计算机工程与应用, 2024, 60(12): 203-215.

References

[1] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779-788.
[2] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of 14th European Conference on Computer Vision, Amsterdam, The Netherlands, October 11-14, 2016. [S.l.]: Springer International Publishing, 2016: 21-37.
[3] JIANG H, LEARNED-MILLER E. Face detection with the faster R-CNN[C]//2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), 2017: 650-657.
[4] CAI Z, VASCONCELOS N. Cascade R-CNN: high quality object detection and instance segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1483-1498.
[5] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980-2988.
[6] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 213-229.
[7] ZHU X, SU W, LU L, et al. Deformable detr: deformable transformers for end-to-end object detection[J]. arXiv:2010. 04159, 2020.
[8] CHEN S, SUN P, SONG Y, et al. Diffusiondet: diffusion model for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 19830-19843.
[9] WANG W, DAI J, CHEN Z, et al. Internimage: exploring large-scale vision foundation models with deformable convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 14408-14419.
[10] WANG J, CHEN K, XU R, et al. Carafe: content-aware reassembly of features[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 3007-3016.
[11] LI X, HU X, YANG J. Spatial group-wise enhance: improving semantic feature learning in convolutional networks[J]. arXiv:1905.09646, 2019.
[12] SONG J, MENG C, ERMON S. Denoising diffusion implicit models[J]. arXiv:2010.02502, 2020.
[13] ZHENG Z, WANG P, LIU W, et al. Distance-IoU loss: faster and better learning for bounding box regression[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 12993-13000.
[14] ZHANG Y F, REN W, ZHANG Z, et al. Focal and efficient IOU loss for accurate bounding box regression[J]. Neurocomputing, 2022, 506: 146-157.
[15] 赵珊, 郑爱玲, 刘子路, 等. 通道分离双注意力机制的目标检测算法[J]. 计算机科学与探索, 2023, 17(5): 1112-1125.
ZHAO S, ZHENG A L, LIU Z L, et al. Object detection algorithm based on channel separation dual attention mechanism[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(5): 1112-1125.
[16] 贾天豪, 彭力, 戴菲菲. 引入残差学习与多尺度特征增强的目标检测器[J]. 计算机科学与探索, 2023, 17(5): 1102-1111.
JIA T H, PENG L, DAI F F. Object detector with residual learning and multi-scale feature enhancement[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(5): 1102-1111.
[17] 崔振东, 李宗民, 杨树林, 等. 基于语义分割引导的三维目标检测[J]. 图学学报, 2022, 43(6): 1134-1142.
CUI Z D, LI Z M, YANG S L, et al. 3D object detection based on semantic segmentation guidance[J]. Journal of Graphics, 2022, 43(6): 1134-1142.
[18] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//Proceedings of 13th European Conference on Computer Visio, Zurich, Switzerland, September 6-12, 2014. [S.l.]: Springer International Publishing, 2014: 740-755.
[19] NIENABER S, KROON R S, BOOYSEN M J. A comparison of low-cost monocular vision techniques for pothole distance estimation[C]//IEEE Symposium on Computational Intelligence, 2016.
[20] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[21] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[22] LIU S, QI L, QIN H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8759-8768.
[23] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2117-2125.
[24] SUN K, ZHAO Y, JIANG B, et al. High-resolution representations for labeling pixels and regions[J]. arXiv:1904.04514, 2019.
[25] 马赛, 葛海波, 何文昊, 等. 轻量高效的自底向上人体姿态估计算法研究[J/OL]. 计算机工程与应用: 1-22[2023-09-20]. http://kns.cnki.net/kcms/detail/11.2127.TP.20230814. 1802.022.html.
MA S, GE H B, HE W H, et al. Research on lightweight and efficient bottom-up human pose estimation algorithm[J/OL]. Computer Engineering and Applications: 1-22 [2023-09-20]. http://kns.cnki.net/kcms/detail/11.2127.TP. 20230814. 1802.022.html.
[26] MISRA D, NALAMADA T, ARASANIPALAI A U, et al. Rotate to attend: convolutional triplet attention module[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021: 3139-3148.
[27] YANG L, ZHANG R Y, LI L, et al. Simam: a simple, parameter-free attention module for convolutional neural networks[C]//International Conference on Machine Learning, 2021: 11863-11874.
[28] WANG Q, WU B, ZHU P, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11534-11542.
[29] GEVORGYAN Z. SIoU loss: more powerful learning for bounding box regression[J]. arXiv:2205.12740, 2022.
[30] TAN M, PANG R, LE Q V. Efficientdet: scalable and efficient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10781-10790.
[31] ZHOU X, WANG D, KR?HENBüHL P. Objects as points[J]. arXiv:1904.07850, 2019.
[32] SUN P, ZHANG R, JIANG Y, et al. Sparse R-CNN: end-to-end object detection with learnable proposals[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 14454-14463.
[33] ZHANG H, LI F, LIU S, et al. Dino: detr with improved denoising anchor boxes for end-to-end object detection[J]. arXiv:2203.03605, 2022.