计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (19): 178-189.DOI: 10.3778/j.issn.1002-8331.2307-0084

• 图形图像处理 • 上一篇    下一篇

基于Contextual Transformer的自动驾驶单目3D目标检测

厍向阳,颜唯佳,董立红   

  1. 西安科技大学  计算机科学与技术学院,西安  710054
  • 出版日期:2024-10-01 发布日期:2024-09-30

Monocular 3D Object Detection for Autonomous Driving Based on Contextual Transformer

SHE Xiangyang, YAN Weijia, DONG Lihong   

  1. College of Computer Science and Technology, Xi’an University of Science and Technology, Xi’an 710054, China
  • Online:2024-10-01 Published:2024-09-30

摘要: 针对当前单目3D目标检测中存在的漏检和多尺度目标检测效果不佳的问题,提出了一种基于Contextual Transformer的自动驾驶单目3D目标检测算法(CM-RTM3D)。在ResNet-50网络中引入Contextual Transformer(CoT),构建ResNet-Transformer架构以提取特征。设计多尺度空间感知模块(MSP),通过尺度空间响应操作改善浅层特征的丢失情况,嵌入沿水平和竖直两个空间方向的坐标注意力机制(CA),使用softmax函数生成各尺度的重要性软权重。在偏移损失中采用Huber损失函数代替L1损失函数。实验结果表明:在KITTI自动驾驶数据集上,相较于RTM3D算法,该算法在简单、中等、困难三个难度级别下,AP3D分别提升了4.84、3.82、5.36个百分点,APBEV分别提升了4.75、6.26、3.56个百分点。

关键词: 自动驾驶, 单目3D目标检测, Contextual Transformer, 多尺度感知, 坐标注意力机制

Abstract: Aiming at the current problems of leakage and poor multi-scale target detection in monocular 3D object detection, a monocular 3D object detection algorithm for autonomous driving based on Contextual Transformer (CM-RTM3D) is proposed. Firstly, Contextual Transformer (CoT) is introduced into the ResNet-50 network to construct the ResNet-Transformer architecture for feature extraction. Secondly, the multi-scale spatial perception (MSP) module is designed to improve the loss of shallow features through scale-space response operations, embedding the coordinate attention mechanism (CA) along both horizontal and vertical spatial directions, and generating soft weights of importance at each scale using the softmax function. Finally, the Huber loss function is used instead of the L1 loss function in the offset loss. The experimental results show that, compared with the RTM3D algorithm on the KITTI autopilot dataset, the algorithm in this paper improves AP3D by 4.84, 3.82, and 5.36 percentage points, and APBEV by 4.75, 6.26, and 3.56 percentage points, respectively, at the three difficulty levels of easy, medium, and difficult.

Key words: autonomous driving, monocular 3D object detection, Contextual Transformer, multi-scale perception, coordinate attention mechanism