计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (9): 221-229.DOI: 10.3778/j.issn.1002-8331.2409-0049

• 图形图像处理 • 上一篇    下一篇

多尺度特征优化的实时Transformer在无人机航拍中的研究

向毅伟,蒋瑜,王琪凯,罗熔熔   

  1. 成都信息工程大学 软件工程学院,成都 610225
  • 出版日期:2025-05-01 发布日期:2025-04-30

Research on Real-Time Transformer for Multi-Scale Feature Optimization in Drone Aerial Imaging

XIANG Yiwei, JIANG Yu, WANG Qikai, LUO Rongrong   

  1. School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
  • Online:2025-05-01 Published:2025-04-30

摘要: 针对无人机目标检测场景中的目标尺度小、遮挡严重、样本分布不均匀等问题,提出了一种改进实时检测Transformer(real-time detection Transformer,RT-DETR)的MSM-DETR检测器。在颈部网络中设计并引入了DSSF特征融合结构,通过结合维度感知选择性整合(DASI)模块以及尺度序列特征融合(SSFF)模块丰富特征融合阶段小目标信息,提高检测精度。针对遮挡严重和样本分布不均匀的问题,提出多核并行尺度间融合(multi-core parallel scale fusion,MCPSF)模块,通过利用尺度间融合思想改进多核分组卷积带来的尺度间信息不平衡问题,为模型提供多尺度感受野,同时利用EMA注意力进一步增强组内上下文信息,提升检测精度。将Inner思想融入原损失函数中,通过引入不同尺度的辅助边框计算损失,加速收敛。实验结果表明,改进后的模型在VisDrone2019数据集中的验证集和测试集的mAP为49.5%、38.9%,较原模型分别提升2.5、2.4个百分点。

关键词: RT-DETR, 航拍图像, 多尺度, 感受野

Abstract: In response to the challenges of small target size, severe occlusion, and uneven sample distribution in UAV target detection scenarios, this paper proposes an enhanced version of the real-time detection Transformer (RT-DETR), termed MSM-DETR. The DSSF feature fusion architecture is designed and incorporated into the neck network, integrating the dimension-aware selective integration (DASI) module and the scale sequence feature fusion (SSFF) module to enrich small target information during the feature fusion stage, thereby enhancing detection accuracy. In addressing the issues of severe occlusion and sample imbalance, the multi-core parallel scale fusion (MCPSF) module is introduced. This module addresses the imbalance in scale-level information caused by multi-core grouped convolutions through scale fusion, allowing the model to capture multi-scale receptive fields. Furthermore, the integration of EMA attention strengthens intra-group contextual information, resulting in further improvements in detection performance. The inner concept is incorporated into the original loss function, facilitating the computation of auxiliary bounding box losses at different scales, thus accelerating convergence. Experimental results demonstrate that the proposed model achieves mAP scores of 49.5% on the validation set and 38.9% on the test set of the VisDrone2019 dataset, representing improvements of 2.5 and 2.4 percentage points, respectively, over the original model.

Key words: real-time detection Transformer (RT-DETR), aerial images, multi-scale, receptive field