基于图像对齐和不确定估计的深度视觉里程计

doi:10.3778/j.issn.1002-8331.2105-0003

摘要/Abstract

摘要： 基于深度学习的视觉里程计方法（deep visual odometry，DVO）通过神经网络直接估计单目图像的深度和相邻图像之间的相机运动，在保证精度的同时大大提高了运行速度。但这是基于灰度不变假设，作为一个很强的假设，灰度不变假设在现实场景中往往难以满足。为此，提出一种基于图像对齐（image alignment，IA）的直接视觉里程计方法AUDVO（aligned U-CNN deep VO），通过不确定性估计网络（uncertainty CNN，U-CNN）引入正则项进行约束，使得估计的结果更具鲁棒性。为了处理大面积纹理缺失区域上因估计不准确带来的空洞，在设计深度估计模块时通过嵌入超分辨率网络进行上采样。在公开的KITTI数据集上的实验证明了AUDVO在深度和相机位姿估计上的有效性。

关键词: 视觉里程计, 深度学习, 不确定性估计网络

Abstract: Deep learning based visual odometry methods can directly estimate the depth of monocular images and camera movement between adjacent images, which achieve high accuracy and improve running speed. However, this is based on a strong assumption of gray scale invariance, which is often not satisfied in real scenes. As a consequence, a self-supervised method for direct visual odometry based on image alignment is proposed, which gets a robust estimation result with the uncertainty regularization terms estimated from the uncertainty estimation network（uncertainty CNN, U-CNN） and it is called AUDVO（aligned U-CNN deep VO）. Meanwhile, a super resolution network is incorporated into the depth estimation module instead of using a simple interpolation operation for upsampling in order to deal with the holes caused by the inaccurate estimation in the large non-texture area. The evaluation results on the public KITTI datasets demonstrate the effectiveness of AUDVO for robust single-view depth estimation and visual odometry.

Key words: visual odometry, deep learning, uncertainty estimation network

秦超, 闫子飞. 基于图像对齐和不确定估计的深度视觉里程计[J]. 计算机工程与应用, 2022, 58(22): 101-107.

QIN Chao, YAN Zifei. Deep Visual Odometry Based on Image Alignment and Uncertainty Estimation[J]. Computer Engineering and Applications, 2022, 58(22): 101-107.

参考文献

[1] RAUL M，JOSE M.ORB-SLAM：a versatile and accurate monocular SLAM system[J].IEEE Transactions on Robotics，2015，31（5）：1147-1163.
[2] DAVISON A J，MURRAY D W.Mobile robot localisation using active vision[C]//5th European Conference on Computer Vision.Berlin，Heidelberg：Springer，1998：809-825.
[3] DAVISON A J.SLAM with a single camera[C]//Workshop on Concurrent Mapping and Localization for Autonomous Mobile Robots in Conjunction with ICRA，Washington，2002：18-27.
[4] KLEIN G，MURRAY D.Parallel tracking and mapping for small AR workspaces[C]//6th IEEE/ACM International Symposium on Mixed and Augmented Reality，2007：225-234.
[5] ZHAO L，WU E，GUO Y，et al.Visual monocular SLAM based on probabilistic selection of random feature points[J].Robot，2010，32（5）：642-646.
[6] JAKOB E，THOMAS S，DANIEL C.LSD-SLAM：large-scale direct monocular SLAM[C]//13th European Conference on Computer Vision，2014：834-849.
[7] ETHAN R，VINCENT R，KURT K，et al.ORB：an efficient alternative to SIFT or SURF[C]//2011 IEEE International Conference on Computer Vision，2011：2564-2571.
[8] ZHOU T，BROWN M，SNAVELY N，et al.Unsupervised learning of depth and ego-motion from video[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：6612-6619.
[9] VIJAYANARASIMHAN S，RICCO S，SCHMID C，et al.SfM-Net：learning of structure and motion from video[J].arXiv：1704.07804，2017.
[10] BIAN J W，LI Z，WANG N，et al.Unsupervised scale-consistent depth and ego-motion learning from monocular video[C]//Advances in Neural Information Processing Systems 32，2019：35-45.
[11] YIN Z，SHI J.GeoNet：unsupervised learning of dense depth，optical flow and camera pose[C]//2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：1983-1992.
[12] ZOU Y，LUO Z，HUANG J B.DF-Net：unsupervised joint learning of depth and flow using cross-task consistency[C]//15th European Conference on Computer Vision，2018：36-53.
[13] GODARD C，MAC AODHA O，BROSTOW G J.Unsupervised monocular depth estimation with left-right consistency[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，2017：270-279.
[14] YANG N，STUMBERG L V，WANG R，et al.D3VO：deep depth，deep pose and deep uncertainty for monocular visual odometry[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：1281-1292.
[15] MUHAMMAD S，GON W.Simultaneous localization and mapping in the epoch of semantics：a survey[J].International Journal of Control，Automation and Systems，2019，17（3）：729-742.
[16] LAGA H，JOSPIN L V，BOUSSAID F，et al.A survey on deep learning techniques for stereo-based depth estimation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2022，44（4）：1738-1764.
[17] RANJAN A，JAMPANI V，BALLES L，et al.Competitive collaboration：joint unsupervised learning of depth，camera motion，optical flow and motion segmentation[C]//2019 IEEE Conference on Computer Vision and Pattern Recognition，Long Beach，2019：12240-12249.
[18] YANG Z，WANG P，WANG Y，et al.Every pixel counts：unsupervised geometry learning with holistic 3D motion understanding[C]//15th European Conference on Computer Vision，Munich，2018：691-709.
[19] KENDALL A，GAL Y.What uncertainties do we need in Bayesian deep learning for computer vision?[C]//Advances in Neural Information Processing Systems 30，Long Beach，2017：5574-5584.
[20] YANG G，MANELA J，HAPPOLD M，et al.Hierarchical deep stereo matching on high-resolution images[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5515-5524.
[21] GODARD C，MAC AODHA O，FIRMAN M，et al.Digging into self-supervised monocular depth estimation[C]//2019 IEEE International Conference on Computer Vision，Long Beach，2019：3828-3838.
[22] MAYER N，ILG E，HAUSSER P，et al.A large dataset to train convolutional networks for disparity，optical flow，and scene flow estimation[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：4040-4048.
[23] SHI W，CABALLERO J，HUSZáR F，et al.Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，2016：1874-1883.
[24] KENDALL A，GRIMES M，CIPOLLA R.PoseNet：a convolutional network for real-time 6D of camera re-localization[C]//2015 IEEE International Conference on Computer Vision，Boston，2015：2938-2946.
[25] JADERBERG M，SIMONYAN K，ZISSERMAN A，et al.Spatial transformer networks[C]//Advances in Neural Information Processing Systems 28，Montreal，2015：2017-2025.
[26] POGGI M，ALEOTTI F，TOSI F，et al.On the uncertainty of self-supervised monocular depth estimation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：3227-3237.
[27] KINGMA D，BA J.Adam：a method for stochastic optimization[J].arXiv：1412.6980，2014.
[28] EIGEN D，PUHRSCH C，FERGUS R.Depth map prediction from a single image using a multiscale deep network[C]//Advances in Neural Information Processing Systems 27，Montreal，2014：2366-2374.
[29] LIU F，SHEN C，LIN G，et al.Learning depth from single monocular images using deep convolutional neural fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2016，38（10）：2024-2039.
[30] YANG Z，WANG P，XU W，et al.Unsupervised learning of geometry from videos with edge-aware depth-normal consistency[C]//32nd AAAI Conference on Artificial Intelligence，2018：7493-7500.
[31] ENGEL J，KOLTUN V，CREMERS D.Direct sparse odometry[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，40（3）：611-625.