Pedestrian Intent Semantic VSLAM in Automatic Driving Scenarios

doi:10.3778/j.issn.1002-8331.2306-0159

Abstract

Abstract: Visual simultaneous localization and mapping (VSLAM) has found extensive applications in the field of autonomous driving. However, conventional algorithms lack semantic information and are incapable of inferring or predicting pedestrians’ behaviors or intentions within a scene. This paper introduces an effective semantic VSLAM method that employs a semantic segmentation algorithm based on dense prediction transformer (DPT) to acquire segmentation masks for potential dynamic targets, enabling dynamic feature removal. Given that the majority of dynamic objects in autonomous driving scenarios are pedestrians and vehicles, in order to both reintegrate static points from potential dynamic targets and re-detect dynamic objects, a geometric constraint is employed to jointly optimize camera poses while predicting pedestrian intentions. To accurately forecast whether pedestrians are crossing the road, a dual-stream, spatiotemporal adaptive graph convolutional neural network is built using human skeletal information to predict pedestrian jaywalking intentions. The results validated on the KITTI dataset indicate that the proposed approach, in comparison to the ORB-SLAM3 algorithm, has a certain reduction in absolute trajectory estimation errors, demonstrating superior precision compared to algorithms of similar nature. This method holds the potential to furnish autonomous driving systems with richer semantic information, thereby enhancing the accomplishment of autonomous driving tasks.

Key words: autonomous driving, semantic segmentation, camera pose optimization, pedestrian intention prediction

摘要： 视觉同步定位与建图（visual simultaneous localization and mapping，VSLAM）在自动驾驶领域有广泛的应用，但传统的算法缺乏语义信息，并且不能推理和预测场景中行人的行为或意图。提出了一种有效的语义VSLAM方法，使用基于DPT（dense prediction transformer）的语义分割算法获取潜在动态目标的分割掩码进行动态特征剔除，由于在自动驾驶场景下的动态物体绝大多数为行人和车辆，为了完成潜在动态目标中静态点的重添加及动态物体的再检测，使用几何约束联合行人意图预测共同优化相机位姿，为了对行人是否过马路进行准确的意图预测，利用人体骨架信息构建双流、时空自适应图卷积神经网络预测行人过街意图。在KITTI数据集下验证的结果表明，所提方法相较于ORB-SLAM3算法的绝对轨迹估计误差有一定减少，且精度优于同类型的算法，有望为自动驾驶系统提供更丰富的语义信息，更好地完成自动驾驶任务。

关键词: 自动驾驶, 语义分割, 相机位姿优化, 行人意图预测

LUO Zhaoyang, ZHANG Rongfen, LIU Yuhong, LI Jin, FAN Runze. Pedestrian Intent Semantic VSLAM in Automatic Driving Scenarios[J]. Computer Engineering and Applications, 2024, 60(17): 107-116.

罗朝阳, 张荣芬, 刘宇红, 李金, 范润泽. 自动驾驶场景下的行人意图语义VSLAM[J]. 计算机工程与应用, 2024, 60(17): 107-116.

References

[1] JIA G, LI X, ZHANG D, et al. Visual-SLAM classical framework and key techniques: a review[J]. Sensors, 2022, 22(12): 4582.
[2] SMITH R. On the estimation and representation of spatial uncertainty[J]. The International Journal of Robotics Research, 1987, 5(4): 113-119.
[3] 徐武, 高寒, 王欣达, 等. 改进ORB-SLAM2算法的关键帧选取及地图构建研究[J]. 电子测量技术, 2022, 45(20): 143-150.
XU W, GAO H, WANG X D, et al. Research on key frame selection and map construction of improved ORB-SLAM2 algorithm[J]. Electronic Measurement Technology, 2022, 45(20): 143-150.
[4] KLEIN G, MURRAY D. Parallel tracking and mapping on a camera phone[C]//Proceedings of the 2009 8th IEEE International Symposium on Mixed and Augmented Reality, 2009: 83-86.
[5] ENGEL J, SCH?PS T, CREMERS D. LSD-SLAM: large-scale direct monocular SLAM[C]//Proceedings of the 13th European Conference on Computer Vision, 2014: 834-849.
[6] MUR-ARTAL R, TARDóS J D. ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras[J]. IEEE Transactions on Robotics, 2017, 33(5): 1255-1262.
[7] CAMPOS C, ELVIRA R, RODRíGUEZ J J G, et al. ORB-SLAM3: an accurate open-source library for visual, visual-inertial, and multimap slam[J]. IEEE Transactions on Robotics, 2021, 37(6): 1874-1890.
[8] WANG R, SCHWORER M, CREMERS D. Stereo DSO: large-scale direct sparse visual odometry with stereo cameras[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 3903-3911.
[9] CADENA C, CARLONE L, CARRILLO H, et al. Past, present, and future of simultaneous localization and mapping: toward the robust-perception age[J]. IEEE Transactions on Robotics, 2016, 32(6): 1309-1332.
[10] YU C, LIU Z, LIU X J, et al. DS-SLAM: a semantic visual SLAM towards dynamic environments[C]//Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018: 1168-1174.
[11] SAPUTRA M R U, MARKHAM A, TRIGONI N. Visual SLAM and structure from motion in dynamic environments: a survey[J]. ACM Computing Surveys, 2018, 51(2): 1-36.
[12] FISCHLER M A, BOLLES R C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography[J]. Communications of the ACM, 1981, 24(6): 381-395.
[13] 刘钰嵩, 何丽, 袁亮, 等. 动态场景下基于光流的语义RGBD-SLAM算法[J]. 仪器仪表学报, 2022, 43(12): 139-148.
LIU Y S, HE L, YUAN L, et al. Dynamic context semantic RGBD-SLAM algorithm based on optical flow[J]. Journal of Instruments and Meters, Lancet, 2022, 43(12): 139-148.
[14] GWYNNE S, ROSENBAUM E R. Employing the hydraulic model in assessing emergency movement[M]//SFPE Handbook of Fire Protection Engineering. New York: Springer, 2016: 2115-2151.
[15] BESCOS B, FáCIL J M, CIVERA J, et al. DynaSLAM: tracking, mapping, and inpainting in dynamic scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4): 4076-4083.
[16] ZHONG F, WANG S, ZHANG Z, et al. Detect-SLAM: making object detection and SLAM mutually beneficial[C]//Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision, 2018: 1001-1010.
[17] XIAO L, WANG J, QIU X, et al. Dynamic-SLAM: semantic monocular visual localization and mapping based on deep learning in dynamic environment[J]. Robotics and Autonomous Systems, 2019, 117: 1-16.
[18] WANG S, LV X, LI J, et al. Coarse semantic-based motion removal for robust mapping in dynamic environments[J]. IEEE Access, 2020, 8: 74048-74064.
[19] SUN T, SUN Y, LIU M, et al. Movable-object-aware visual slam via weakly supervised semantic segmentation[J]. arXiv:1906.03629, 2019.
[20] 高兴波, 史旭华, 葛群峰, 等. 面向动态物体场景的视觉SLAM综述[J]. 机器人, 2021, 43(6): 733-750.
GAO X B, SHI X H, GE Q F, et al. Dynamic scene object oriented visual SLAM review[J]. Robot, 2021, 43(6): 733-750.
[21] 胡远志, 蒋涛, 刘西, 等. 基于双流自适应图卷积神经网络的行人过街意图识别[J]. 汽车安全与节能学报, 2022, 13(2): 325-332.
HU Y Z, JIANG T, LIU X, et al. Pedestrian crossing intention recognition based on dual-stream adaptive graph convolutional neural network[J]. Journal of Automotive Safety and Energy Conservation, 2022, 13(2): 325-332.
[22] ZENG Z. High efficiency pedestrian crossing prediction[J]. arXiv:2204.01862, 2022.
[23] RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 12179-12188.
[24] 宋佳艳, 苏圣超. 基于改进蚁群优化算法的自动驾驶多车协同运动规划[J]. 计算机工程, 2022, 48(11): 299-305.
SONG J Y, SU S C. Based on improved ant colony optimization algorithm of automatic car driving more coordinated motion planning[J]. Computer Engineering, 2022, 48(11): 299-305.
[25] SONG S J, LAN C L, XING J L, et al. An end to-end spatio-temporal attention model for human action recognition from skeleton data[C]//Proceedings of the 31st AAAI Conference Artificial Intelligence, 2017: 4263-4270.
[26] QUINTERO R, PARRA I, LORENZO J, et al. Pedestrian intention recognition by means of a hidden Markov model and body language[C]//Proceedings of the 2017 IEEE 20th International Conference Intelligence Transportation Systems, 2017: 1-7.