计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (17): 123-135.DOI: 10.3778/j.issn.1002-8331.2501-0414

• 目标检测专题 • 上一篇    下一篇

光学遥感图像目标检测的CNN-Transformer多尺度融合算法

郑文轩,谭忠,杨瑛   

  1. 1.江苏第二师范学院 物理与电子信息学院,南京 210013
    2.厦门大学 数学科学学院,福建 厦门 361005
  • 出版日期:2025-09-01 发布日期:2025-09-01

CNN-Transformer Multi-Scale Fusion Algorithm for Optical Remote Sensing Images Object Detection

ZHENG Wenxuan, TAN Zhong, YANG Ying   

  1. 1.School of Physics and Electronic Information Engineering, Jiangsu Second Normal University, Nanjing 210013, China
    2.School of Mathematical Sciences, Xiamen University, Xiamen, Fujian 361005, China
  • Online:2025-09-01 Published:2025-09-01

摘要: 针对光学遥感图像目标分布密集、尺度多变、小目标特征信息不足致检测精度不高的问题,提出了一种面向光学遥感图像目标检测的LQ-Mixer-YOLOv8检测模型。能有效整合卷积神经网络CNN和Transformer在提取图像局部(高频)和全局(低频)特征信息的优势。为了进一步提升模型的性能,实验设计了DMulti-DWconv卷积模块和自适应细节融合模块ADI,引入坐标注意力机制CA、挤压增强轴向注意力机制SeaAttention和频率斜坡结构等来提升网络特征性能,将WIoU损失函数与NWD小目标检测算法联合使用,进一步提高对光学遥感图像中小目标的检测精度。实验结果表明,LQ-Mixer-YOLOv8模型在NWPU VHR-10和DIOR数据集的测试集上分别取得了96.3%和96.6%的平均精度。在NWPU VHR-10数据集上,将其与Faster R-CNN、YOLOv3、YOLOv7S、YOLOv8S、SwinTransformer和RT-DETR等主流模型对比,平均精度(mAP@0.5)分别提高10.3、6.0、1.6、2.1、7.8和6.5个百分点;在 DIOR数据集上,将其与对应的主流模型对比,平均精度(mAP@0.5)分别提高了10.5、7.3、2.3、2.7、7.5和6.7个百分点。该方法具有检测精度高、计算复杂度低等特点,能更好地完成光学遥感图像目标检测任务。

关键词: Transformer, 光学遥感图像, 目标检测, 特征融合

Abstract: To address the issues of dense object distribution, variable scales, and insufficient feature information for small objects leading to low detection accuracy in optical remote sensing images, this paper proposes an LQ-Mixer-YOLOv8 detection model for object detection in optical remote sensing images. This model effectively integrates the advantages of convolutional neural networks (CNN) and Transformer in extracting local (high-frequency) and global (low-frequency) feature information from images. To further enhance the model’s performance, the paper designs a DMulti-DWconv module and an adaptive detail integration (ADI) module, incorporating coordinate attention (CA) and squeeze-enhanced axial attention (SeaAttention) mechanisms to improve the network’s feature extraction capabilities. A frequency ramp structure is employed to better balance the composition of local (high-frequency) and global (low-frequency) feature information. The weighted intersection over union (WIoU) loss function is combined with the normalized Wasserstein distance (NWD) small object detection algorithm to further improve the detection accuracy of small targets in optical remote sensing images. Experimental results show that the LQ-Mixer-YOLOv8 model achieves average detection accuracies of 96.3% and 96.6% on the test sets of the NWPU VHR-10 and DIOR datasets, respectively. On the NWPU VHR-10 dataset, when compared with mainstream models such as Faster R-CNN, YOLOv3, YOLOv7S, YOLOv8S, SwinTransformer, and RT-DETR, the average precision (mAP@0.5) of the LQ-Mixer-YOLOv8 model is improved by 10.3, 6.0, 1.6, 2.1, 7.8, and 6.5 percentage points, respectively. On the DIOR dataset, when compared with the corresponding mainstream models, the average precision (mAP@0.5) of the LQ-Mixer-YOLOv8 model is increased by 10.5, 7.3, 2.3, 2.7, 7.5, and 6.7 percentage points, respectively. This method has the characteristics of high detection accuracy and low computational complexity, and can better complete the task of object detection in optical remote sensing images.

Key words: Transformer, optical remote sensing image, object detection, feature fusion