Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (24): 216-227.DOI: 10.3778/j.issn.1002-8331.2409-0328

• Graphics and Image Processing • Previous Articles     Next Articles

Multispectral Object Detection Based on Cross-Modality Adaptive Fusion Network

ZHENG Shangpo1, LIU Junfeng1, ZENG Jun2+, XU Shikang1, LIAO Dingding1   

  1. 1.School of Automation Science and Engineering, South China University of  Technology, Guangzhou 510641, China
    2.School of Electric Power Engineering, South China University of  Technology, Guangzhou 510641, China
  • Online:2025-12-15 Published:2025-12-15

基于跨模态自适应融合网络的多光谱目标检测

郑尚坡1,刘俊峰1,曾君2+,徐诗康1,廖丁丁1   

  1. 1.华南理工大学 自动化科学与工程学院,广州 510641 
    2.华南理工大学 电力学院,广州 510641

Abstract: Multispectral object detection technology integrates images from different spectral modalities, such as RGB and thermal infrared images, to accurately detect objects in complex environments like low visibility conditions. However, existing methods often lead to dominance of features from one modality during the modality interaction process, consequently overlooking crucial features of the other modality, and failing to fully exploit the complementary information between modalities and the inherent specific details within each modality. Additionally, most methods use relatively simple feature fusion strategies, which prevents the model from effectively distinguishing and integrating key features from different modalities, thus limiting improvements in detection accuracy. To address these challenges, a cross-modality adaptive fusion network(CAFNet) is proposed, comprising the cross-modality interactive Transformer(CMIT) module, multimodal adaptive weighted fusion(MAWF) module, and 3D attention feature enhancement(3D-AFE) module. The CMIT module, designed based on the Transformer structure, effectively mines and utilizes complementary and detailed feature information across modalities. The MAWF module facilitates efficient adaptive fusion of multimodal features. And the 3D-AFE module enhances the model’s comprehensive perception of key features by incorporating non-local and SimAM attention mechanisms. Extensive experimental results and ablation studies on multispectral datasets such as FLIR, LLVIP, and VEDAI confirm the effectiveness of the proposed method, with performance surpassing several current mainstream methods.

Key words: attention mechanisms, Transformer, multispectral object detection, cross-modality, adaptive feature fusion, feature enhancement

摘要: 多光谱目标检测技术通过融合来自不同光谱模态的图像(如RGB图像与热红外图像),以实现在低能见度等复杂环境中准确检测出物体。现有方法在模态交互过程中往往会导致某一模态特征信息占据主导地位,以至于忽略或丢失另一种模态的关键特征信息,难以充分挖掘和利用不同模态之间的互补信息以及模态内固有的特定细节信息。大多方法采用较为简单的特征融合策略,导致模型并不能有效区分和融合各模态的关键特征,限制了模型检测精度的提升。为解决这些挑战,提出一种跨模态自适应融合网络模型(cross-modality adaptive fusion network,CAFNet),该网络架构由跨模态交互Transformer(cross-modality interactive Transformer,CMIT)模块、多模态自适应加权融合(multimodal adaptive weighted fusion,MAWF)模块和3D注意力特征增强(3D attention feature enhancement,3D-AFE)模块共同组成。基于Transformer结构设计的CMIT模块用于有效挖掘和利用模态间的互补特征信息和模态内的细节特征信息;MAWF模块用于实现高效的多模态特征自适应融合;3D-AFE模块则通过引入Non-local和SimAM注意力机制,增强模型对关键特征的全面感知能力。在FLIR、LLVIP和VEDAI多光谱数据集上的大量实验结果和消融研究证实了所提方法的有效性,且检测性能超越多个当前主流方法。

关键词: 注意力机制, Transformer, 多光谱目标检测, 跨模态, 自适应特征融合, 特征增强