计算机工程与应用 ›› 2026, Vol. 62 ›› Issue (4): 273-283.DOI: 10.3778/j.issn.1002-8331.2508-0255

• 图形图像处理 • 上一篇    下一篇

基于多模态融合网络的无人机小目标检测方法

姚继林1,刘宏哲1+,张铖1,2,路璐3   

  1. 1.北京联合大学 北京市信息服务工程重点实验室,北京 100101
    2.北京强强源起科技有限公司,北京 100101
    3.北京安信创业信息科技发展有限公司,北京 100013
    + 通信作者 E-mail:liuhongzhe@buu.edu.cn
  • 收稿日期:2025-08-24 修回日期:2025-10-25 在线发布日期:2026-02-15 出版日期:2026-02-15
  • 基金资助:
    国家自然科学基金(62171042,U24A20331);国家语委重点项目(ZDI145-110);北京市重点科技项目(KZ202211417048);北京市属高等学校高水平科研创新团队建设支持计划(BPHR20220121);北京市自然科学基金(4232026,4242020);北京联合大学学术研究项目(ZK20202514)。

UAV Small Target Detection Method Based on Multi-Modal Fusion Network

YAO Jilin1, LIU Hongzhe1+, ZHANG Cheng1,2, LU Lu3   

  1. 1.Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China
    2.Beijing Qiangqiang Yuanqi Technology Co., Ltd., Beijing 100101, China
    3.Beijing Anxin Entrepreneurship Information Technology Development Co., Ltd., Beijing 100013, China
    + Corresponding author E-mail:liuhongzhe@buu.edu.cn
  • Received:2025-08-24 Revised:2025-10-25 Online:2026-02-15 Published:2026-02-15

摘要: 针对无人机小目标检测在低光和复杂背景下精度低、误检率高的问题,单一模态检测方法难以取得理想效果。因此,提出了一种基于动态交互融合策略的小目标检测方法YOLOv8-DF。设计了双流特征提取网络,采用并行模式分别提取红外和可见光特征,并在Backbone部分引入感受野卷积(receptive field attention convolution,RFAConv),以增强模型的多尺度感知能力。为进一步捕捉全局上下文信息,提出了基于注意力机制的远程信息增强模块(remote information enhancement module,RIEM),并设计了动态特征交互模块(dynamic feature interaction module,DFIM),动态调整各模态优势信息权重,实现深度特征融合。实验结果表明,在DroneVehicle数据集上,YOLOv8-DF与单模态基准模型YOLOv8_RGB和YOLOv8_IR相比,mAP50分别提升了12.3和13.4个百分点,相较于PSFusion融合检测方法,mAP50提升了8.4个百分点,并在LLVIP公开数据集上进行了计算效率和泛化性实验,证明了所提出的方法具有较好的计算效率和泛化性能。

关键词: 无人机检测, 多模态, 小目标检测, 特征融合, YOLO

Abstract: To address the problem of low accuracy and high false detection rate in UAV small target detection under low-light and complex background conditions,where single-modality detection methods often fail to achieve satisfactory performance, this paper proposes a small target detection algorithm based on a dynamic interaction fusion strategy, named YOLOv8-DF. A dual-stream feature extraction network is designed, which extracts infrared and visible features in parallel. The receptive field attention convolution (RFAConv) is introduced into the backbone to enhance the multi-scale perception capability of the model. To further capture global contextual information, a remote information enhancement module (RIEM) based on attention mechanism is proposed. A dynamic feature interaction module (DFIM) is designed to dynamically adjust the weights of advantageous information across modalities, thereby achieving deep feature fusion. Experimental results show that, on the DroneVehicle dataset, YOLOv8-DF improves mAP50 by 12.3 and 13.4 percentage points compared with the single-modality baseline models YOLOv8_RGB and YOLOv8_IR, respectively. Compared with the PSFusion detection method, mAP50 is improved by 8.4 percentage points. Furthermore, computational efficiency and generalization experiments conducted on the LLVIP public dataset demonstrate that the proposed method achieves both competitive efficiency and strong generalization performance.

Key words: UAV detection, multimodal, small target detection, feature fusion, YOLO