光学遥感图像目标检测的CNN-Transformer多尺度融合算法

doi:10.3778/j.issn.1002-8331.2501-0414

摘要/Abstract

摘要： 针对光学遥感图像目标分布密集、尺度多变、小目标特征信息不足致检测精度不高的问题，提出了一种面向光学遥感图像目标检测的LQ-Mixer-YOLOv8检测模型。能有效整合卷积神经网络CNN和Transformer在提取图像局部（高频）和全局（低频）特征信息的优势。为了进一步提升模型的性能，实验设计了DMulti-DWconv卷积模块和自适应细节融合模块ADI，引入坐标注意力机制CA、挤压增强轴向注意力机制SeaAttention和频率斜坡结构等来提升网络特征性能，将WIoU损失函数与NWD小目标检测算法联合使用，进一步提高对光学遥感图像中小目标的检测精度。实验结果表明，LQ-Mixer-YOLOv8模型在NWPU VHR-10和DIOR数据集的测试集上分别取得了96.3%和96.6%的平均精度。在NWPU VHR-10数据集上，将其与Faster R-CNN、YOLOv3、YOLOv7S、YOLOv8S、SwinTransformer和RT-DETR等主流模型对比，平均精度（mAP@0.5）分别提高10.3、6.0、1.6、2.1、7.8和6.5个百分点；在 DIOR数据集上，将其与对应的主流模型对比，平均精度（mAP@0.5）分别提高了10.5、7.3、2.3、2.7、7.5和6.7个百分点。该方法具有检测精度高、计算复杂度低等特点，能更好地完成光学遥感图像目标检测任务。

关键词: Transformer, 光学遥感图像, 目标检测, 特征融合

Abstract: To address the issues of dense object distribution, variable scales, and insufficient feature information for small objects leading to low detection accuracy in optical remote sensing images, this paper proposes an LQ-Mixer-YOLOv8 detection model for object detection in optical remote sensing images. This model effectively integrates the advantages of convolutional neural networks (CNN) and Transformer in extracting local (high-frequency) and global (low-frequency) feature information from images. To further enhance the model’s performance, the paper designs a DMulti-DWconv module and an adaptive detail integration (ADI) module, incorporating coordinate attention (CA) and squeeze-enhanced axial attention (SeaAttention) mechanisms to improve the network’s feature extraction capabilities. A frequency ramp structure is employed to better balance the composition of local (high-frequency) and global (low-frequency) feature information. The weighted intersection over union (WIoU) loss function is combined with the normalized Wasserstein distance (NWD) small object detection algorithm to further improve the detection accuracy of small targets in optical remote sensing images. Experimental results show that the LQ-Mixer-YOLOv8 model achieves average detection accuracies of 96.3% and 96.6% on the test sets of the NWPU VHR-10 and DIOR datasets, respectively. On the NWPU VHR-10 dataset, when compared with mainstream models such as Faster R-CNN, YOLOv3, YOLOv7S, YOLOv8S, SwinTransformer, and RT-DETR, the average precision (mAP@0.5) of the LQ-Mixer-YOLOv8 model is improved by 10.3, 6.0, 1.6, 2.1, 7.8, and 6.5 percentage points, respectively. On the DIOR dataset, when compared with the corresponding mainstream models, the average precision (mAP@0.5) of the LQ-Mixer-YOLOv8 model is increased by 10.5, 7.3, 2.3, 2.7, 7.5, and 6.7 percentage points, respectively. This method has the characteristics of high detection accuracy and low computational complexity, and can better complete the task of object detection in optical remote sensing images.

Key words: Transformer, optical remote sensing image, object detection, feature fusion

郑文轩, 谭忠, 杨瑛. 光学遥感图像目标检测的CNN-Transformer多尺度融合算法[J]. 计算机工程与应用, 2025, 61(17): 123-135.

ZHENG Wenxuan, TAN Zhong, YANG Ying. CNN-Transformer Multi-Scale Fusion Algorithm for Optical Remote Sensing Images Object Detection[J]. Computer Engineering and Applications, 2025, 61(17): 123-135.

参考文献

[1] CHENG G, XIE X X, HAN J W, et al. Remote sensing image scene classification meets deep learning: challenges, methods, benchmarks, and opportunities[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 13: 3735-3756.
[2] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 580-587.
[3] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6): 1137-1149.
[4] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 936-944.
[5] HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN [C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2961-2969.
[6] CAI Z W, VASCONCELOS N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6154-6162.
[7] DAI J F, LI Y, HE K M, et al. R-FCN: object detection via region-based fully convolutional networks[C]//Advances in Neural Information Processing Systems, 2016: 379-387.
[8] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 779-788.
[9] 周沁坤, 周华平, 孙克雷, 等. ARST-YOLOv7: 用于航空遥感图像的小目标检测网络[J]. 计算机工程与应用, 2025, 61(12): 232-242.
ZHOU Q K, ZHOU H P, SUN K L, et al. ARST-YOLOv7: small target detection network for aerial remote sensing images[J]. Computer Engineering and Applications, 2025, 61(12): 232-242.
[10] 杨志渊, 罗亮, 吴天阳, 等. 改进YOLOv8的轻量级光学遥感图像船舶目标检测算法[J]. 计算机工程与应用, 2024, 60(16): 248-257.
YANG Z Y, LUO L, WU T Y, et al. Improved lightweight ship target detection algorithm for optical remote sensing images with YOLOv8[J]. Computer Engineering and Applications, 2024, 60(16): 248-257.
[11] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 21-37.
[12] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2999-3007.
[13] ZHOU X Y, WANG D Q, KR?HENBüHL P, et al. Objects as points[J]. arXiv:1904.07850, 2019.
[14] TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10778-10787.
[15] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 213-229.
[16] JOSH B, ERIC K, ERIC T, et al. Toward transformer-based object detection[J]. arXiv:2012.09958, 2020.
[17] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[18] PENG Z L, HUANG W, GU S Z, et al. Conformer: local features coupling global representations for visual recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 357-366.
[19] XU X K, FENG Z J, CAO C Q, et al. An improved swin transformer-based model for remote sensing object detection and instance segmentation[J]. Remote Sensing, 2021, 13(23): 4779.
[20] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002.
[21] 赵其昌, 吴一全, 苑玉彬. 光学遥感图像舰船目标检测与识别方法研究进展[J]. 航空学报, 2024, 45(8): 51-84.
ZHAO Q C, WU Y Q, YUAN Y B. Research progress on detection and recognition methods of ship targets in optical remote sensing images[J]. Acta Aeronautica et Astronautica Sinica, 2024, 45(8): 51-84.
[22] HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 13708-13717.
[23] WAN Q, HUANG Z, LU J, et al. SeaFormer: squeeze-enhanced axial transformer for mobile semantic segmentation[J]. arXiv:2301.13156, 2023.
[24] WU H P, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 22-31.
[25] PENG Y, SONKA M, &CHEN, D Z. U-Net v2: rethinking the skip connections of U-Net for medical image segmentation[J]. arXiv:2311.17791, 2023.
[26] CHOLLET F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1800-1807.
[27] CHENG G, ZHOU P C, HAN J W. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2016, 54(12): 7405-7415.
[28] LI K, WAN G, CHENG G, et al. Object detection in optical remote sensing images: a survey and a new benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 159: 296-307.
[29] TONG Z J, CHEN Y H, XU Z W, et al. Wise-IoU: bounding box regression loss with dynamic focusing mechanism[J]. arXiv:2301.10051, 2023.
[30] REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: a metric and a loss for bounding box regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 658-666.
[31] ZHENG Z H, WANG P, LIU W, et al. Distance-IoU loss: faster and better learning for bounding box regression[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 12993-13000.
[32] DANELLJAN M, KHAN F S, FELSBERG M, et al. Adaptive color attributes for real-time visual tracking[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014: 1090-1097.
[33] GEVORGYAN Z. SIoU loss: more powerful learning for bounding box regression[J]. arXiv:2205.12740, 2022.
[34] REDMON J, FARHADI A. YOLOv3: an incremental improvement[J]. arXiv:1804.02767, 2018.
[35] 苗茹, 岳明, 周珂, 等. 基于改进YOLOv7的遥感图像小目标检测方法[J]. 计算机工程与应用, 2024, 60(10): 246-255.
MIAO R, YUE M, ZHOU K, et al. Small target detection method in remote sensing images based on improved YOLOv7[J]. Computer Engineering and Applications, 2024, 60(10): 246-255.
[36] 张秀再, 沈涛, 许岱. 基于改进YOLOv8算法的遥感图像目标检测[J]. 激光与光电子学进展, 2024, 61(10): 1028001.
ZHANG X Z, SHEN T, XU D. Remote-sensing image object detection based on improved YOLOv8 algorithm[J]. Laser & Optoelectronics Progress, 2024, 61(10): 1028001.
[37] ZHAO Y, LYU W, XU S, et al. DETRs beat YOLOs on real-time object detection[J]. arXiv:2304.08069, 2023.
[38] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 618-626.