计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (24): 251-260.DOI: 10.3778/j.issn.1002-8331.2409-0322

• 图形图像处理 • 上一篇    下一篇

基于事件帧和RGB帧融合的交通场景目标检测方法

黄家才1+,常国卫1,洪颖2,高芳征1,杨博1,邵立奇1   

  1. 1.南京工程学院 自动化学院,南京 211167
    2.金陵海关技术中心,南京 211106
  • 出版日期:2025-12-15 发布日期:2025-12-15

Object Detection Method for Traffic Scenarios Based on Fusion of Event Frames and RGB Frames

HUANG Jiacai1+, CHANG Guowei1, HONG Ying2, GAO Fangzheng1, YANG Bo1, SHAO Liqi1   

  1. 1.School of Automation, Nanjing Institute of Technology, Nanjing 211167, China
    2.Jinling Customs Technology Center, Nanjing 211106, China
  • Online:2025-12-15 Published:2025-12-15

摘要: 将事件数据和帧数据多模态融合的方法在交通领域目标检测中的应用日益广泛。然而,目前大多方法使用的帧数据主要来源于事件相机直接输出的灰度图像,这限制了其在识别交通信号灯等复杂环境中区分目标的能力。对此,提出了一种新颖的多模态目标检测方法,该方法结合了RGB帧和事件帧,包含一个特征融合模块AddWithBAM,结合了通道注意力和空间注意力机制,并在C3模块中引入了第四代可变形卷积网络(deformable convolutional networks,DCNv4),以增强对复杂几何形状和大尺度变化的感知,同时利用可变形卷积实现轻量化设计。基于公共数据集PKU-DDD17-CAR和MVSEC的实验结果显示,平均精度mAP分别达到了94.5%和96.7%。由于缺乏包含RGB帧和事件帧的公共交通数据集,构建了一个新的交通数据集,在该数据集上进行的消融实验显示,mAP提升至96.4%,模型参数量减少了9.1%。在实际场景的检测实验中,平均推理时间为13.7 ms。所提的多模态双流网络架构有效提升了在复杂交通环境中的目标检测性能。

关键词: 事件相机, 目标检测, 多模态特征融合, 注意力机制

Abstract: Multimodal fusion of event and frame data is widely applied in traffic target detection. Current methods often rely on grayscale images from event cameras, limiting their ability to distinguish targets in complex environments like traffic lights and signs. This paper proposes a multimodal dual-stream network that fuses RGB and event frames. The network integrates the AddWithBAM module, combining channel and spatial attention. The third module (C3) introduces deformable convolutional networks (DCNv4) to enhance perception of complex shapes and scale variations while maintaining a lightweight design. Experiments on PKU-DDD17-CAR and MVSEC datasets show mAP values of 94.5% and 96.7%. A new traffic dataset is constructed due to the lack of existing public datasets containing both RGB and event frames. Ablation experiments on this dataset show than the mAP has been improved to 96.4% and the model parameters has been reduced by 9.1%. In real-world tests, the average inference time is 13.7 ms. The proposed dual-stream architecture significantly enhances target detection in complex traffic environments.

Key words: event camera, object detection, multi-modal features fusion, attention mechanism