基于Contextual Transformer的自动驾驶单目3D目标检测

doi:10.3778/j.issn.1002-8331.2307-0084

摘要/Abstract

摘要： 针对当前单目3D目标检测中存在的漏检和多尺度目标检测效果不佳的问题，提出了一种基于Contextual Transformer的自动驾驶单目3D目标检测算法（CM-RTM3D）。在ResNet-50网络中引入Contextual Transformer（CoT），构建ResNet-Transformer架构以提取特征。设计多尺度空间感知模块（MSP），通过尺度空间响应操作改善浅层特征的丢失情况，嵌入沿水平和竖直两个空间方向的坐标注意力机制（CA），使用softmax函数生成各尺度的重要性软权重。在偏移损失中采用Huber损失函数代替L1损失函数。实验结果表明：在KITTI自动驾驶数据集上，相较于RTM3D算法，该算法在简单、中等、困难三个难度级别下，AP3D分别提升了4.84、3.82、5.36个百分点，APBEV分别提升了4.75、6.26、3.56个百分点。

关键词: 自动驾驶, 单目3D目标检测, Contextual Transformer, 多尺度感知, 坐标注意力机制

Abstract: Aiming at the current problems of leakage and poor multi-scale target detection in monocular 3D object detection, a monocular 3D object detection algorithm for autonomous driving based on Contextual Transformer (CM-RTM3D) is proposed. Firstly, Contextual Transformer (CoT) is introduced into the ResNet-50 network to construct the ResNet-Transformer architecture for feature extraction. Secondly, the multi-scale spatial perception (MSP) module is designed to improve the loss of shallow features through scale-space response operations, embedding the coordinate attention mechanism (CA) along both horizontal and vertical spatial directions, and generating soft weights of importance at each scale using the softmax function. Finally, the Huber loss function is used instead of the L1 loss function in the offset loss. The experimental results show that, compared with the RTM3D algorithm on the KITTI autopilot dataset, the algorithm in this paper improves AP3D by 4.84, 3.82, and 5.36 percentage points, and APBEV by 4.75, 6.26, and 3.56 percentage points, respectively, at the three difficulty levels of easy, medium, and difficult.

Key words: autonomous driving, monocular 3D object detection, Contextual Transformer, multi-scale perception, coordinate attention mechanism

厍向阳, 颜唯佳, 董立红. 基于Contextual Transformer的自动驾驶单目3D目标检测[J]. 计算机工程与应用, 2024, 60(19): 178-189.

SHE Xiangyang, YAN Weijia, DONG Lihong. Monocular 3D Object Detection for Autonomous Driving Based on Contextual Transformer[J]. Computer Engineering and Applications, 2024, 60(19): 178-189.

参考文献

[1] READING C, HARAKEH A, CHAE J, et al. Categorical depth distribution network for monocular 3D object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 8555-8564.
[2] DING M, HUO Y, YI H, et al. Learning depth-guided convolutions for monocular 3D object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 1000-1001.
[3] LU Y, MA X, YANG L, et al. Geometry uncertainty projection network for monocular 3D object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 3111-3121.
[4] KU J, PON A, WASLANDER S. Monocular 3D object detection leveraging accurate proposals and shape reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 11867-11876.
[5] LI B, OUYANG W, SHENG L, et al. GS3D: an efficient 3D object detection framework for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1019-1028.
[6] CHEN Y, TAI L, SUN K, et al. Monopair: monocular 3D object detection using pairwise spatial relationships[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12093-12102.
[7] CHABOT F, CHAOUCH M, RABARISOA J, et al. Deep manta: a coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2040-2049.
[8] LI P, ZHAO H, LIU P, et al. RTM3D: real-time monocular 3D detection from object keypoints for autonomous driving[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 644-660.
[9] MA X, ZHANG Y, XU D, et al. Delving into localization errors for monocular 3D object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 4721-4730.
[10] ZHOU X, WANG D, KRAHENBUHL P. Objects as points[J]. arxiv:1904.07850, 2019.
[11] ZHANG Y, LU J, ZHOU J. Objects are different: flexible monocular 3D object detection[C]//Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition, 2021: 3289-3298.
[12] LIU Z, WU Z, TOTH R. Smoke: single-stage monocular 3D object detection via keypoint estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 996-997.
[13] LI Y, YAO T, PAN Y, et al. Contextual transformer networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(2): 1489-1500.
[14] HOU Q, ZHOU D, FENG J. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13713-13722.
[15] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[16] GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? the kitti vision benchmark suite[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: 3354-3361.
[17] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980-2988.
[18] MOUSAVIAN A, ANGUELOV D, FLYNN J, et al. 3D bounding box estimation using deep learning and geometry[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 7074-7082.
[19] SUN P, KRETZSCHMAR H, DOTIWALLA X, et al. Scalability in perception for autonomous driving: Waymo open dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 2446-2454.
[20] CAESAR H, BANKITI V, LANG A, et al. nuScenes: a multi-modal dataset for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11621-11631.
[21] QIN Z, WANG J, LU Y. MonoGRNet: a general framework for monocular 3D object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(9) :5170-5184.
[22] BRAZIL G, LIU X. M3D-RPN: monocular 3D region proposal network for object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[23] LIU L, LU J, XU C, et al. Deep fitting degree scoring network for monocular 3D object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1057-1066.