Vehicle Detection of Multi-Modal Attention Fusion Under Different Illumination

doi:10.3778/j.issn.1002-8331.2305-0166

Abstract

Abstract: Aiming at the performance degradation of existing single-modal vehicle detection algorithms caused by illumination changes, a multi-modal detection method YOLO-MMF, which combines infrared and visible light, is proposed. This method builds an efficient dual-stream feature extraction network, extracts the features of visible light images and infrared images respectively, replaces the bottleneck layer in the shallow CSP module in YOLOv5 with the DenseBlock structure, and strengthens the feature extraction ability of small targets. This method adopts feature fusion mechanism, uses discrete cosine transform to obtain high-frequency information, improves the loss of detail information due to average pooling, and combines the self-attention mechanism, so that the network can spontaneously capture the potential complementarity between modalities, thereby significantly improving vehicle detection performance. The experimental results on the DroneVehicle dataset confirm the effectiveness of the method, with an average detection accuracy improvement of 14.4 percentage points and 10.8 percentage points respectively, compared to the single-modal detection approach, which is more robust in the face of complex situations such as illumination shifts.

Key words: vehicle detection, multi-modal fusion, self-attentive mechanism, discrete cosine transform

摘要： 针对现有基于单模态车辆检测算法受光照变换所导致的性能下降问题，提出了一种红外和可见光融合的多模态检测方法YOLO-MMF。该方法构建高效的双流特征提取网络，分别提取可见光图像和红外图像的特征，用DenseBlock结构代替了YOLOv5中浅层CSP模块中的瓶颈层，加强对小目标的特征提取能力；采用特征级融合机制，利用离散余弦变换获取高频信息，改善因平均池化使细节信息丢失的现象，并与自注意力机制相结合，使网络可以自发捕捉模态间潜在的互补性，从而显著提高车辆检测的性能。在DroneVehicle数据集上的实验结果证实了该方法的有效性，相比单一模态检测方式，平均检测精度分别提升了14.4个百分点和10.8个百分点，该方法在面对光照变换等复杂情况时具有较好的鲁棒性。

关键词: 车辆检测, 多模态融合, 自注意力机制, 离散余弦变换

WANG Jiaqi, ZHANG Qi, HUANG Wei. Vehicle Detection of Multi-Modal Attention Fusion Under Different Illumination[J]. Computer Engineering and Applications, 2024, 60(16): 116-123.

王佳琪, 张淇, 黄巍. 不同光照下多模态注意力融合的车辆检测[J]. 计算机工程与应用, 2024, 60(16): 116-123.

References

[1] 李静静. 红外图像的目标检测与识别方法研究[D]. 沈阳: 沈阳理工大学, 2013.
LI J J. Research on target detection and recognition method of infrared image[D]. Shenyang: Shenyang University of Technology, 2013.
[2] 李伟林. 基于目标提取的红外与可见光图像融合算法[J]. 计算机仿真, 2014, 31(11): 358-361.
LI W L. Infrared and visible image fusion algorithm based on target extraction [J]. Computer Simulation, 2014, 31(11) : 358-361.
[3] HWANG S, PARK J, KIM N, et al. Multispectral pedestrian detection: benchmark dataset and baseline[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1037-1045.
[4] JIA X, ZHU C, LI M, et al. LLVIP: a visible-infrared paired dataset for low-light vision[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 3496-3504.
[5] YANG D, LIU X, HE H, et al. Air-to-ground multimodal object detection algorithm based on feature association learning[J]. International Journal of Advanced Robotic Systems, 2019. DOI:10.1177/1729881419842995.
[6] GUAN D, CAO Y, YANG J, et al. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection[J]. Information Fusion, 2019, 50: 148-157.
[7] ZHANG L, LIU Z, ZHANG S, et al. Cross-modality interactive attention network for multispectral pedestrian detection[J]. Information Fusion, 2019, 50: 20-29.
[8] SHARMA M, DHANARAJ M, KARNAM S, et al. YOLOrs: object detection in multimodal remote sensing imagery[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 14: 1497-1508.
[9] LIU J, FAN X, HUANG Z, et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 5802-5811.
[10] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[11] ATREY P K, HOSSAI M A, SADDIL A E, et al. Multimodal fusion for multimedia analysis: a survey[J]. Multimedia Systems, 2010, 16: 345-379.
[12] LIU J, ZHANG S, WANG S, et al. Multispectral deep neural networks for pedestrian detection[J]. arXiv:1611. 02644, 2016.
[13] LI C, SONG D, TONG R, et al. Multispectral pedestrian detection via simultaneous detection and segmentation[J]. arXiv:1808.04818, 2018.
[14] ZHANG H, FROMONT E, LEFEVRE S, et al. Guided attentive feature fusion for multispectral pedestrian detection[C]//Proceedings of the 2021 IEEE/CVF Winter Conference on Applications of Computer Vision, 2021: 72-80.
[15] ZHANG L, ZHU X, CHEN X, et al. Weakly aligned cross-modal learning for multispectral pedestrian detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 5127-5137.
[16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017.
[17] HUANG G, LIU Z, LAURENS V D M, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4700-4708.
[18] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[19] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2818-2826.
[20] SUN Y, CAO B, ZHU P, et al. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty aware learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(10): 6700-6713.
[21] REDMON J, FARHADI A. YOLOv3: An incremental improvement[J]. arXiv:1804.02767, 2018.
[22] WANG C Y, BOCHKOVSKIY A, LIAO H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[J]. arXiv:2207.02696, 2022.