计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (16): 116-123.DOI: 10.3778/j.issn.1002-8331.2305-0166

• 模式识别与人工智能 • 上一篇    下一篇

不同光照下多模态注意力融合的车辆检测

王佳琪,张淇,黄巍   

  1. 1.武汉工程大学 计算机科学与工程学院,武汉 430205
    2.智能机器人湖北省重点实验室,武汉 430205
    3.武汉工程大学 电气学院,武汉 430205
  • 出版日期:2024-08-15 发布日期:2024-08-15

Vehicle Detection of Multi-Modal Attention Fusion Under Different Illumination

WANG Jiaqi, ZHANG Qi, HUANG Wei   

  1. 1.School of Computer Science and Engineering, Wuhan Engineering University, Wuhan 430205, China
    2.Hubei Key Laboratory of Intelligent Robotics, Wuhan 430205, China
    3.School of Electricity, Wuhan Engineering University, Wuhan 430205, China
  • Online:2024-08-15 Published:2024-08-15

摘要: 针对现有基于单模态车辆检测算法受光照变换所导致的性能下降问题,提出了一种红外和可见光融合的多模态检测方法YOLO-MMF。该方法构建高效的双流特征提取网络,分别提取可见光图像和红外图像的特征,用DenseBlock结构代替了YOLOv5中浅层CSP模块中的瓶颈层,加强对小目标的特征提取能力;采用特征级融合机制,利用离散余弦变换获取高频信息,改善因平均池化使细节信息丢失的现象,并与自注意力机制相结合,使网络可以自发捕捉模态间潜在的互补性,从而显著提高车辆检测的性能。在DroneVehicle数据集上的实验结果证实了该方法的有效性,相比单一模态检测方式,平均检测精度分别提升了14.4个百分点和10.8个百分点,该方法在面对光照变换等复杂情况时具有较好的鲁棒性。

关键词: 车辆检测, 多模态融合, 自注意力机制, 离散余弦变换

Abstract: Aiming at the performance degradation of existing single-modal vehicle detection algorithms caused by illumination changes, a multi-modal detection method YOLO-MMF, which combines infrared and visible light, is proposed. This method builds an efficient dual-stream feature extraction network, extracts the features of visible light images and infrared images respectively, replaces the bottleneck layer in the shallow CSP module in YOLOv5 with the DenseBlock structure, and strengthens the feature extraction ability of small targets. This method adopts feature fusion mechanism, uses discrete cosine transform to obtain high-frequency information, improves the loss of detail information due to average pooling, and combines the self-attention mechanism, so that the network can spontaneously capture the potential complementarity between modalities, thereby significantly improving vehicle detection performance. The experimental results on the DroneVehicle dataset confirm the effectiveness of the method, with an average detection accuracy improvement of 14.4 percentage points and 10.8 percentage points respectively, compared to the single-modal detection approach, which is more robust in the face of complex situations such as illumination shifts.

Key words: vehicle detection, multi-modal fusion, self-attentive mechanism, discrete cosine transform