Efficient Vehicle Detection in Remote Sensing Images with Bi-Directional Multi-Scale Feature Fusion

doi:10.3778/j.issn.1002-8331.2308-0386

Abstract

Abstract: Facing with the challenges of the vehicle detection in remote sensing images, such as complex backgrounds, multi-scale differences, and difficulty in detecting small targets, a detection method GEM_YOLO based on bidirectional multi-scale feature fusion is proposed. There are three main parts in this method: the first one is a globally efficient attention module that is designed as a feature extractor to achieve lightweight and efficient feature extraction, in order to solve the problem of object detection in complex backgrounds. Secondly, a bidirectional multi-scale feature fusion network is proposed as a feature fusion device, which adopts top-down and bottom-up feature fusion strategies to effectively promote information exchange between features at different levels. Finally, the application of an attention based on the dynamic detection head as a predictor enhances the perception of different scales, spatial positions, and tasks, further improving the accuracy and robustness of object detection. Related experiments are conducted on public datasets DIOR and DOTA, whose average accuracy reaches 92.4% and 81.4% that is significantly superior to other mainstream detection methods. Meanwhile, the fewer parameters and computational complexity provide an efficient solution for vehicle detection within the domain of remote sensing image detection.

Key words: remote sensing images, vehicle inspection, multi-scale feature fusion, attention mechanism, dynamic detection head

摘要： 针对遥感图像中车辆检测面临的背景复杂、多尺度差异和小目标难以检测等挑战，提出了一种基于双向多尺度特征融合的检测方法GEM_YOLO。该方法包括三个主要部分：设计了全局高效注意力模块作为特征提取器，实现轻量化和高效率的特征提取，以解决复杂背景下的目标检测问题；提出了双向多尺度特征融合网络作为特征融合器，采用自顶向下和自底向上的特征融合策略，有效促进不同层次特征之间的信息交互；应用基于注意力的动态检测头作为预测器，增强了对不同尺度、空间位置和任务的感知，进一步提升了目标检测的精度和鲁棒性。在公开数据集DIOR和DOTA上进行相关实验，该方法的平均精度均值达到92.4%和81.4%，显著优于其他主流检测方法，同时具有更少的参数量和计算量，为遥感图像检测领域中的车辆检测提供了一种高效解决方案。

关键词: 遥感图像, 车辆检测, 多尺度特征融合, 注意力机制, 动态检测头

QU Haicheng, WANG Meng, CHAI Rui. Efficient Vehicle Detection in Remote Sensing Images with Bi-Directional Multi-Scale Feature Fusion[J]. Computer Engineering and Applications, 2024, 60(12): 346-356.

曲海成, 王蒙, 柴蕊. 双向多尺度特征融合的高效遥感图像车辆检测[J]. 计算机工程与应用, 2024, 60(12): 346-356.

References

[1] 杨锦帆, 王晓强, 林浩, 等. 深度学习中的单阶段车辆检测算法综述[J]. 计算机工程与应用, 2022, 58(7): 55-67.
YANG J F, WANG X Q, LIN H, et al. Review of one-stage vehicle detection algorithms based on deep learning[J]. Computer Engineering and Applications, 2022, 58 (7): 55-67.
[2] 李科岑, 王晓强, 林浩, 等. 深度学习中的单阶段小目标检测方法综述[J]. 计算机科学与探索, 2022, 16(1): 41-58.
LI K C, WANG X Q, LIN H, et al. Survey of one-stage small object detection methods in deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16 (1): 41-58.
[3] 史彩娟, 张卫明, 陈厚儒, 等. 基于深度学习的显著性目标检测综述[J]. 计算机科学与探索, 2021, 15(2): 219-232.
SHI C J, ZHANG W M, CHEN H R, et al. Survey of salient object detection based on deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15 (2): 219-232.
[4] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 580-587.
[5] GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440-1448.
[6] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems, 2015.
[7] ZHOU Y, LIU L, SHAO L, et al. Fast automatic vehicle annotation for urban traffic surveillance[J]. IEEE Transactions on Intelligent Transportation Systems, 2017, 19(6): 1973-1984.
[8] YUAN X, SU S, CHEN H. A graph-based vehicle proposal location and detection algorithm[J]. IEEE Transactions on Intelligent Transportation Systems, 2017, 18(12): 3282-3289.
[9] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, October 11-14, 2016: 21-37.
[10] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980-2988.
[11] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779-788.
[12] REDMON J, FARHADI A. Yolov3: an incremental improvement[J]. arXiv:1804.02767, 2018.
[13] BOCHKOVSKIY A, WANG C Y, LIAO H Y M. Yolov4: optimal speed and accuracy of object detection[J]. arXiv:2004.10934, 2020.
[14] WANG C Y, BOCHKOVSKIY A, LIAO H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[J]. arXiv:2207.02696, 2022.
[15] 王琳毅, 白静, 李文静, 等. YOLO系列目标检测算法研究进展[J]. 计算机工程与应用, 2023, 59(14): 15-29.
WANG L Y, BAI J, LI W J, et al. Research progress of YOLO series target detection algorithms[J]. Computer Engineering and Applications, 2023, 59(14): 15-29.
[16] ZHU X, LYU S, WANG X, et al. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios[C]//Proceedings of the IEEE/CVF International Conference on Computer vision, 2021: 2778-2788.
[17] 申铉京, 李涵宇, 黄永平, 等. 基于自适应多尺度特征融合网络的车辆检测方法[J/OL]. 电子学报: 1-9[2023-08-24]. http://kns.cnki.net/kcms/detail/11.2087.tn.20230330.1000.
056.html.
SHEN X J, LI H Y, HUANG Y P, et al. A vehicle detection method based on adaptive multi-scale feature fusion network[J/OL]. Acta Electronica Sinica: 1-9[2023-08-24]. http://kns.cnki.net/kcms/detail/11.2087.tn.20230330.1000.
056. html.
[18] ZHANG Z, LU X, CAO G, et al. ViT-YOLO: transformer-based YOLO for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 2799-2808.
[19] LIU Z, GAO G, SUN L, et al. HDNet: high-resolution detection network for small objects[C]//2021 IEEE International Conference on Multimedia and Expo (ICME), 2021: 1-6.
[20] LIU Y, SHAO Z, HOFFMANN N. Global attention mechanism: retain information to enhance channel-spatial interactions[J]. arXiv:2112.05561, 2021.
[21] WOO S, PARK J, LEE J Y, et al. Cbam: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 3-19.
[22] CHEN J, KAO S, HE H, et al. Run, don’t walk: chasing higher FLOPS for faster neural networks[J]. arXiv:2303. 03667, 2023.
[23] WANG C Y, MARK LIAO H Y, CHEN P Y, et al. Enriching variety of layer-wise learning information by gradient combination[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019.
[24] WANG J, CHEN K, XU R, et al. Carafe: content-aware reassembly of features[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 3007-3016.
[25] ZHU X, HU H, LIN S, et al. Deformable convnets v2: more deformable, better results[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9308-9316.
[26] 刘卫光, 刘东, 王璐. 可变形卷积网络研究综述[J]. 计算机科学与探索, 2023, 17(7): 1549-1564.
LIU W G, LIU D, WANG L. Survey of deformable convolutional networks[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(7): 1549-1564.
[27] DAI X, CHEN Y, XIAO B, et al. Dynamic head: unifying object detection heads with attentions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 7373-7382.
[28] LI K, WAN G, CHENG G, et al. Object detection in optical remote sensing images: a survey and a new benchmark[J]. ISPRS Journal of Photogrammetry and Remote sensing, 2020, 159: 296-307.
[29] XIA G S, BAI X, DING J, et al. DOTA: a large-scale dataset for object detection in aerial images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 3974-3983.
[30] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 618-626.