基于深度学习的相机位姿估计方法综述

doi:10.3778/j.issn.1002-8331.2209-0280

摘要/Abstract

摘要： 相机位姿估计是指在已知环境下精确地估计相机在世界坐标系中六自由度位姿的技术，该技术是机器人技术和自动驾驶中的关键技术。随着深度学习的飞速发展，使用深度学习来优化相机位姿估计算法已经成为了当前的研究热点之一。为了掌握目前相机位姿估计算法的研究现状与趋势，对基于深度学习的相机位姿估计的主流算法进行了综述。简单介绍了传统的基于特征点的相机位姿估计方法。重点介绍了基于深度学习的方法：根据核心算法的不同，从端到端的相机位姿估计、场景坐标回归、基于检索的相机位姿估计、层级结构、多信息融合和跨场景的相机位姿估计六个方面进行了详细的阐述和分析。对研究现状进行了总结，并基于深入的性能分析指出了相机位姿估计领域面临的挑战，展望了其发展动向。

关键词: 深度学习, 相机位姿估计, 场景坐标回归, 多信息融合

Abstract: Camera pose estimation is a technology to accurately estimate the 6-DOF position and pose of camera in world coordinate system under known environment. It is a key technology in robotics and automatic driving. With the rapid development of deep learning, using deep learning to optimize camera pose estimation algorithm has become one of the current research hotspots. In order to master the current research status and trends of camera pose estimation algorithms, the mainstream algorithms based on deep learning are summarized. Firstly, the traditional camera pose estimation methods based on feature points is briefly introduced. Then, the camera pose estimation method based on deep learning is mainly introduced. According to the different core algorithms, the end-to-end camera pose estimation, scene coordinate regression, camera pose estimation based on retrieval, hierarchical structure, multi-information fusion and cross scenescamera pose estimation are elaborated and analyzed in detail. Finally, this paper summarizes the current research status, points out the challenges in the field of camera pose estimation based on in-depth performance analysis, and prospects the development trend of camera pose estimation.

Key words: deep learning, camera pose estimation, scene coordinate regression, multi-information fusion

王静, 金玉楚, 郭苹, 胡少毅. 基于深度学习的相机位姿估计方法综述[J]. 计算机工程与应用, 2023, 59(7): 1-14.

WANG Jing, JIN Yuchu, GUO Ping, HU Shaoyi. Survey of Camera Pose Estimation Methods Based on Deep Learning[J]. Computer Engineering and Applications, 2023, 59(7): 1-14.

参考文献

[1] DURRANT-WHYTE H，BAILEY T.Simultaneous localization and mapping：part I[J].IEEE Robotics & Automation Magazine，2006，13（2）：99-110.
[2] NISTER D，NARODITSKY O，BERGEN J R.Visual odometry[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington DC：IEEE Computer Society，2004：652-659．
[3] 陈宗海，裴浩渊，王纪凯，等.基于单目相机的视觉重定位方法综述[J].机器人，2021，43（3）：373-384.
CHEN Z H，PEI H Y，WANG J K，et al.Survey of monocular camera-based visual relocalization[J].Robot，2021，43（3）：373-384.
[4] SHAVIT Y，FERENS R.Introduction to camera pose estimation with deep learning[J].arXiv：1907.05272，2019.
[5] 刘艺，李蒙蒙，郑奇斌，等.视频目标跟踪算法综述[J].计算机科学与探索，2022，16（7）：1504-1515.
LIU Y，LI M M，ZHENG Q B，et al.Survey on video object tracking algorithms[J].Journal of Frontiers of Computer Science and Technology，2022，16（7）：1504-1515.
[6] TORII A，ARANDJELOVIC R，SIVIC J，et al.24/7 place recognition by view synthesis[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2015：1808-1817.
[7] BRACHMANN E，ROTHER C.Neural-guided RANSAC：Learning where to sample model hypotheses[C]//Proceedings of IEEE/CVF International Conference on Computer Vision，2019：4321-4330.
[8] LINDEBERG T.Scale invariant feature transform[J].Scholarpedia，2012，7（5）.
[9] RUBLEE E，RABAUD V，KONOLIGE K，et al.ORB：an efficient alternative to SIFT or SURF[C]//Proceedings of International Conference on Computer Vision，2011：2564-2571.
[10] JEGOU H，DOUZE M，SCHMID C，et al.Aggregating local descriptors into a compact image representation[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2010：3304-3311.
[11] DETONE D，MALISIEWICZ T，RABINOVICH A.SuperPoint：self-supervised interest point detection and description[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops，2018：337-349.
[12] LI K，WANG L，LIU L，et al.Decoupling makes weakly supervised local feature better[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：15838-15848.
[13] ARANDJELOVIC R，GRONAT P，TORII A，et al.NetVLAD：CNN architecture for weakly supervised place recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2016：5297-5307.
[14] SILPA-ANAN C，HARTLEY R.Optimised KD-trees for fast image descriptor matching[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition，Anchorage，2008：1-8.
[15] KRISHNA K，NARASIMHA MURTY M.Genetic K-means algorithm[J].IEEE Transactions on Systems，Man，and Cybernetics（Part B Cybernetics），1999，29（3）：433-439.
[16] SATTLER T，LEIBE B，KOBBELT L.Fast image-based localization using direct 2D-to-3D matching[C]//Proceedings of International Conference on Computer Vision，2011：667-674.
[17] SATTLER T，LEIBE B，KOBBELT L.Efficient & effective prioritized matching for large-scale image-based localization[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，39（9）：1744-1756.
[18] MUJA M，LOWE D G.Scalable nearest neighbor algorithms for high dimensional data[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2014，36（11）：2227-2240.
[19] FISCHLER M A，BOLLES R C.Random sample consensus：A paradigm for model fitting with applications to image analysis and automated cartography[J].Communications of the ACM，1981，24（6）：381-395.
[20] LEPETIT V，MORENO-NOGUER F，FUA P.EPnP：an accurate [O(n)] solution to the PnP problem[J].International Journal of Computer Vision，2009，81（2）：155-166.
[21] KNEIP L，LI H，SEO Y.UPnP：an optimal [O(n)] solution to the absolute pose problem with universal applicability[C]//Proceedings of European Conference on Computer Vision，2014：127-142.
[22] KENDALL A，GRIMES M，CIPOLLA R.Posenet：a convolutional network for real-time 6-DoF camera relocalization[C]//Proceedings of IEEE International Conference on Computer Vision，2015：2938-2946.
[23] WALCH F，HAZIRBAS C，LEAL-TAIXE L，et al.Image-based localization using LSTMs for structured feature correlation[C]//Proceedings of IEEE International Conference on Computer Vision，2017：627-637.
[24] WANG B，CHEN C，LU C X，et al.AtLoc：attention guided camera localization[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：10393-10401.
[25] XUE F，WU X，CAI S J，et al.Learning multi-view camera relocalization with graph neural networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：11372-11381.
[26] BRACHMANN E，KRULL A，NOWOZIN S，et al.DSAC—differentiable RANSAC for camera localization[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2017：2492-2500.
[27] BRACHMANN E，ROTHER C.Learning less is more-6D camera localization via 3D surface regression[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition，2018：4654-4662.
[28] DUONG N D，SOLADIE C，KACETE A，et al.Efficient multi-output scene coordinate prediction for fast and accurate camera relocalization from a single RGB image[J].Computer Vision and Image Understanding，2020，190：102850.
[29] HUANG Z，ZHOU H，LI Y，et al.VS-Net：voting with Segmentation for Visual Localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：6101-6111.
[30] LASKAR Z，MELEKHOV I，KALIA S，et al.Camera relocalization by computing pairwise relative poses using convolutional neural network[C]//Proceedings of IEEE International Conference on Computer Vision Workshops，2017：920-929.
[31] ZHOU Q，SATTLER T，POLLEFEYS M，et al.To learn or not to learn：visual localization from essential matrices[C]//Proceedings of IEEE International Conference on Robotics and Automation，2020：3319-3326.
[32] SARLIN P E，CADENA C，SIEGWART R，et al.From coarse to fine：robust hierarchical localization at large scale[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2019：12708-12717.
[33] DING M Y，WANG Z，SUN J K，et al.CamNet：coarse-to-fine retrieval for camera re-localization[C]//Proceedings of IEEE International Conference on Computer Vision，2019：2871-2880.
[34] LI X T，WANG S Z，ZHAO Y，et al.Hierarchical scene coordinate classification and regression for visual localization[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：11980-11989.
[35] VALADA A，RADWAN N，BURGARD W.Deep auxiliary learning for visual localization and odometry[C]//Proceedings of IEEE International Conference on Robotics and Automation，2018：6939-6946.
[36] ZHOU L，LUO Z，SHEN T，et al.KfNet：learning temporal camera relocalization using Kalman filtering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：4919-4928.
[37] ZHOU K，CHEN C，WANG B，et al.VMLoc：variational fusion for learning-based multimodal camera localization[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021，35（7）：6165-6173.
[38] LI T，ZHAN Z，TAN G.Accurate visual localization with semantic masking and attention[J].EURASIP Journal on Advances in Signal Processing，2022，42（1）：1-17.
[39] YANG L，BAI Z，TANG C，et al.SANet：scene agnostic network for camera localization[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：42-51.
[40] SARLIN P E，UNAGAR A，LARSSON M，et al.Back to the feature：Learning robust camera localization from pixels to pose[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：3247-3257.
[41] SZEGEDY C，LIU W，JIA Y，et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2015：1-9.
[42] KENDALL A，CIPOLLA R.Modelling uncertainty in deep learning for camera relocalization[C]//Proceedings of 2016 IEEE International Conference on Robotics and Automation，2016：4762-4769.
[43] KENDALL A，CIPOLLA R.Geometric loss functions for camera pose regression with deep learning[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2017：6555-6564.
[44] MELEKHOV I，YLIOINAS J，KANNALA J，et al.Image-based localization using hourglass networks[C]//Proceedings of IEEE International Conference on Computer Vision Workshops，2017：870-877.
[45] WU J，MA L，HU X.Delving deeper into convolutional neural networks for camera relocalization[C]//Proceedings of IEEE International Conference on Robotics and Automation，2017：5644-5651.
[46] SHAVIT Y，FERENS R，KELLER Y.Paying attention to activation maps in camera pose regression[J].arXiv：2103. 11477，2021.
[47] GHOFRANI A，TOROGHI R M，TABATABAIE S M.Catiloc：camera image transformer for indoor localization[C]//Proceedings of 2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：1450-1454.
[48] CLARK R，WANG S，MARKHAM A，et al.VidLoc：a deep spatio-temporal model for 6-DOF video-clip relocalization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：6856-6864.
[49] LI M，QIN J，LI D，et al.VNLSTM-PoseNet：a novel deep ConvNet for real-time 6-DOF camera relocalization in urban streets[J].Geo-Spatial Information Science，2021，24（3）：422-437.
[50] 阮晓钢，李昂，黄静.基于自监督循环卷积神经网络的位姿估计方法[J].北京工业大学学报，2021，47（12）：1311-1320.
RUAN X G，LI A，HUANG J.Pose estimation method based on self-supervised recurrent convolutional neural networks[J].Journal of Beijing University of Technology，2021，47（12）：1311-1320.
[51] TURKOGLU M O，BRACHMANN E，SCHINDLER K，et al.Visual camera re-localization using graph neural networks and relative pose supervision[C]//Proceedings of 2021 International Conference on 3D Vision，2021：145-155.
[52] ELMOOGY A，DONG X，LU T，et al.Pose-GNN：camera pose estimation system using graph neural networks[J].arXiv：2103.09435，2021.
[53] BLANTON H，GREENWELL C，WORKMAN S，et al.Extending absolute pose regression to multiple scenes[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops，2020：38-39.
[54] SHAVIT Y，FERENS R，KELLER Y.Learning multi-scene absolute pose regression with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：2733-2742.
[55] SHOTTON J，GLOCKER B，ZACH C，et al.Scene coordinate regression forests for camera relocalization in RGB-D images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2013：2930-2937.
[56] LI X T，YLIOINAS J，AND KANNALA J.Full frame scene coordinate regression for image based localization[J].arXiv：1802.03237，2018.
[57] BRACHMANN E，ROTHER C.Visual camera re-localization from RGB and RGB-D images using DSAC[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，44（9）：5847-5865.
[58] 王静，胡少毅，郭苹，等.改进场景坐标回归网络的室内相机重定位方法[J/OL].计算机工程与应用：1-12（2022-06-27）[2022-10-04].http：//kns.cnki.net/kcms/detail/11.2127.TP.20220627.1352.014.html.
WANG J，HU S Y，GUO P，et al.Indoor camera relocation method base on improved scene coordinate regression network[J/OL].Computer Engineering and Applications：1-12（2022-06-27）[2022-10-04].http：//kns.cnki.net/kcms/detail/11.2127.TP.20220627.1352.014.html.
[59] GUAN P，CAO Z，YU J，et al.Scene coordinate regression network with global context-guided spatial feature transformation for visual relocalization[J].IEEE Robotics and Automation Letters，2021，6（3）：5737-5744.
[60] XIE T，DAI K，WANG K，et al.A Deep feature aggregation network for accurate indoor camera localization[J].IEEE Robotics and Automation Letters，2022，7（2）：3687-3694.
[61] CAI M，ZHAN H，SAROJ WEERASEKERA C，et al.Camera relocalization by exploiting multi-view constraints for scene coordinates regression[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops，2019：3769-3777.
[62] DO T，MIKSIK O，DEGOL J，et al.Learning to detect scene landmarks for camera localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：11132-11142.
[63] BALNTAS V，LI S，PRISACARIU V.ReLocNet：continuous metric learning relocalisation using neural nets[C]//Proceedings of the European Conference on Computer Vision，2018：751-767.
[64] LI Q，ZHU J，CAO R，et al.Relative geometry-aware siamese neural network for 6DOF camera relocalization[J].Neurocomputing，2021，426：134-146.
[65] ABOUELNAGA Y，BUI M，ILIC S.DistillPose：Lightweight camera localization using auxiliary learning[C]//Proceedings of 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems（IROS），2021：7919-7924.
[66] YANG S，SHI D.RnR：retrieval and reprojection learning model for camera localization[J].IEEE Access，2021，9：34626-34634.
[67] SON M，KO K.Learning-based essential matrix estimation for visual localization[J].Journal of Computational Design and Engineering，2022，9（3）：1097-1106.
[68] BRAHMBHATT S，GU J，KIM K，et al.Geometry-aware learning of maps for camera localization[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition，2018：2616-2625.
[69] RADWAN N，VALADA A，BURGARD W.VlocNet++：deep multitask learning for semantic visual localization and odometry[J].IEEE Robotics and Automation Letters，2018，3（4）：4407-4414.
[70] SHI T X，SHEN S H，GAO X，et al.Visual localization using sparse semantic 3D map[C]//Proceedings of IEEE International Conference on Image Processing，2019：315-319.
[71] CHEN L C，ZHU Y，PAPANDREOU G，et al.Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of European Conference on Computer Vision，2018：833-851.
[72] CHEN H，XIONG Y，WANG J，et al.Long-term visual localization with semantic enhanced global retrieval[C]//Proceedings of 2021 17th International Conference on Mobility，Sensing and Networking（MSN），2021：319-326.
[73] TIAN M，NIE Q，SHEN H.3D scene geometry-aware constraint for camera localization with deep learning[C]//Proceedings of 2020 IEEE International Conference on Robotics and Automation，2020：4211-4217.
[74] YAN Q，ZHENG J，REDING S，et al.CrossLoc：scalable aerial localization assisted by multimodal synthetic data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：17358-17368.
[75] WU J，SHI Q，LU Q，et al.Learning invariant semantic representation for long-term robust visual localization[J].Engineering Applications of Artificial Intelligence，2022，111：104793.
[76] OTT F，FEIGL T，LOFFLER C，et al.ViPR：visual-odometry-aided pose regression for 6DoF camera localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops，2020：42-43.
[77] PARAMESHWARA C M，HARI G，FERMüLLER C，et al.DiffPoseNet：direct differentiable camera pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：6845-6854.
[78] TANG S，TANG C，HUANG R，et al.Learning camera localization via dense scene matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：1831-1841.
[79] HARTLEY R，ZISSERMAN A.Multiple view geometry in computer vision[M].Cambridge：Cambridge University Press，2003.
[80] GLOCKER B，IZADI S，SHOTTON J，et al.Real-time RGB-D camera relocalization[C]//Proceedings of IEEE International Symposium on Mixed and Augmented Reality，2013：173-179.
[81] VALENTIN J，DAI A，NIE?NER M，et al.Learning to navigate the energy landscape[C]//Proceedings of 2016 Fourth International Conference on 3D Vision，2016：323-332.
[82] SATTLER T，MADDERN W，TOFT C，et al.Benchmarking 6DOF outdoor visual localization in changing conditions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：8601-8610.
[83] MADDERN W，PASCOE G，LINEGAR C，et al.1 year，1?000 km：the Oxford RobotCar dataset[J].International Journal of Robotics Research，2017，36（1）：3-15.
[84] BRACHMANN E，HUMENBERGER M，ROTHER C，et al.On the limits of pseudo ground truth in visual camera re-localisation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：6218-6228.