Review of Stereo Image Disparity Estimation Methods Based on Depth Learning

doi:10.3778/j.issn.1002-8331.2204-0382

Abstract

Abstract: 3D reconstruction technology is widely used in autonomous driving, robotics, drones and augmented reality, etc. Disparity estimation is a key step in 3D reconstruction. With the increase of datasets and the development of hardware and network models, deep learning disparity estimation models for disparity estimation are widely used and achieve good results. However, these methods often use objects in outdoor scenes and are rarely used in datasets for indoor scenes. The paper reviews deep learning methods for binocular disparity estimation, and selects five representative deep learning networks, such as PSMNet（pyramid stereo matching network）, GA-Net（guided aggregation network）, LEAStereo（hierarchical neural architecture search for deep stereo matching）, DeepPruner（learning efficient stereo matching via differentiable patchmatch）, BGNet（bilateral grid learning for stereo matching networks）, and applies it to a real-world street view dataset （KITTI2015） and two indoor scene datasets （Middlebury2014, Instereo2K）. Each model building method are analyzed. This paper evaluates the performance of deep learning in the disparity estimation of indoor scene images, and compares it with the traditional SGM method. Finally, according to the research content of the deep learning disparity estimation method, the problems and challenges it faces are pointed out.

Key words: disparity estimation, deep learning, indoor image, convolutional neural networks

摘要： 三维重建技术常用于自动驾驶、机器人、无人机和增强现实等领域。视差估计是三维重建的关键步骤，随着数据集的增加、硬件和网络模型的发展，深度学习视差估计模型被广泛使用并取得良好效果。然而，这些方法常用室外场景的物体，很少使用在室内场景的数据集中。回顾了双目视差估计的深度学习方法，选用5种深度学习网络：PSMNet（pyramid stereo matching network）、GA-Net（guided aggregation network）、LEAStereo（hierarchical neural architecture search for deep stereo matching）、DeepPruner（learning efficient stereo matching via differentiable patchmatch）、BGNet（bilateral grid learning for stereo matching networks），将其运用在一套真实世界的街景数据集（KITTI2015）和两套室内场景数据集（Middlebury2014、Instereo2K）；分析各模型搭建方法，评估深度学习在室内场景影像视差估计中的性能，并与传统的SGM方法进行比较。针对深度学习视差估计方法的研究内容，指出其面临的问题及挑战。

关键词: 视差估计, 深度学习, 室内影像, 卷积神经网络

WANG Daolei, XIAO Jiawei, LI Jiankang, ZHU Rui. Review of Stereo Image Disparity Estimation Methods Based on Depth Learning[J]. Computer Engineering and Applications, 2022, 58(20): 16-27.

王道累, 肖佳威, 李建康, 朱瑞. 基于深度学习的立体影像视差估计方法综述[J]. 计算机工程与应用, 2022, 58(20): 16-27.

References

[1] LAGA H，JOSPIN L V，BOUSSAID F，et al.A survey on deep learning techniques for stereo-based depth estimation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2020，44（4）：1738-1764.
[2] BARNES C.Patchmatch：a randomized correspondence algorithm for structural image editing[J].ACM Transactions on Graphics，2009，28（3）：1-11.
[3] SCHARSTEIN D，SZELISKI R.A taxonomy and evaluation of dense two-frame stereo correspondence algorithms[J].International Journal of Computer Vision，2002，47（1）：7-42.
[4] ZAARANE A，SLIMANI I，OKAISHI W A，et al.Distance measurement system for autonomous vehicles using stereo camera-ScienceDirect[J].Array，2020，5：100016.
[5] ZHAO Y，HOU X，LEI J，et al.The obstacle avoidance system for mobile robot based on binocular stereo vision[C]//2010 8th World Congress on Intelligent Control and Automation，2010.
[6] TAO T W，QU J Y.Optical imaging for medical diagnosis based on active stereo vision and motion tracking[J].Optics Express，2007，15（16）：10421-10426.
[7] GOSHIN Y V，FURSOV V A.3D scene reconstruction from stereo images with unknown extrinsic parameters[J].Computer Optics，2015，39（5）：770-776.
[8] ARUNAGIRI S，JORDAN V J，TELLER P J，et al.Stereo matching：performance study of two global algorithms[J].Proceedings of SPIE-The International Society for Optical Engineering，2011，8021：1-17.
[9] SUN J，SHUM H Y，ZHENG N.Stereo matching using belief propagation[C]//Computer Vision-ECCV 2002，7th European Conference on Computer Vision，Copenhagen，Denmark，2003.
[10] YOON K J，KWEON I S.Adaptive support-weight approach for correspondence search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2006，28（4）：650-656.
[11] PANG J，SUN W，REN J S，et al.Cascade residual learning：a two-stage convolutional neural network for stereo matching[C]//IEEE International Conference on Computer Vision Workshops，2017.
[12] GAO Z.Multi-scale dense attention network for stereo matching[J].Electronics，2020，9：1881.
[13] TECHNICOLOR T，RELATED S，TECHNICOLOR T，et al.ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems，2012.
[14] ?BONTAR J，LECUN Y.Stereo matching by training a convolutional neural network to compare image patches[J].The Journal of Machine Learning Research，2016，17（1）：2287-2318.
[15] SHAKED A，WOLF L.Improved stereo matching with constant highway networks and reflective confidence learning[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2017.
[16] CHEN Z，XUN S，LIANG W，et al.A deep visual correspondence embedding model for stereo matching costs[C]//2015 IEEE International Conference on Computer Vision（ICCV），2015.
[17] YANG T，HE Q，NING W，et al.A new stereo matching method with combination of cross-based aggregation and hierarchical belief propagation[C]//2012 IEEE International Conference on Information and Automation，2012.
[18] YANG Q，WANG L，YANG R，et al.Stereo matching with color-weighted correlation，hierarchical belief propagation，and occlusion handling[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2009，31（3）：492-504.
[19] ZHOU Kun，MENG Xiangxi，CHENG Bo.Review of stereo matching algorithms based on deep learning[J].Computational Intelligence and Neuroscience，2020：8562323.
[20] SEKI A，POLLEFEYS M.SGM-Nets：semi-global matching with neural networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2017.
[21] GIDARIS S，KOMODAKIS N.Detect，replace，refine：deep structured prediction for pixel wise labeling[J].arXiv：1612.04770，2016.
[22] WANG Q，ZHENG S，YAN Q，et al.IRS：a large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation[C]//International Conference on Multimedia and Expo，2021.
[23] KANBARA M，FUJII H，TAKEMURA H，et al.A stereo vision-based augmented reality system with an inertial sensor[C]//IEEE & ACM International Symposium on Augmented Reality，2000.
[24] CHESSA M，SOLARI F，SABATINI S P.A virtual reality simulator for active stereo vision systems[C]//VISAPP 2009-Proceedings of the Fourth International Conference on Computer Vision Theory and Applications，Lisboa，Portugal，February 5-8，2009.
[25] CHANG J R，CHEN Y S.Pyramid stereo matching network[J].arXiv：1803.08669，2018.
[26] CHENG X，ZHONG Y，HARANDI M，et al.Hierarchical neural architecture search for deep stereo matching[C]//34th Conference on Neural Information Processing Systems，2020.
[27] YANG G，MANELA J，HAPPOLD M，et al.Hierarchical deep stereo matching on high-resolution images[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2019.
[28] TANKOVICH V，HNE C，FANELLO S，et al.HITNet：hierarchical iterative tile refinement network for real-time stereo matching[C]//Computer Vision and Pattern Recognition，2021.
[29] ZHANG F，PRISACARIU V，YANG R，et al.GA-Net：guided aggregation net for end-to-end stereo matching[J].arXiv：1904.06587，2019.
[30] DUGGAL S，WANG S，MA W C，et al.DeepPruner：learning efficient stereo matching via differentiable patchmatch[C]//International Conference on Computer Vision，2019.
[31] XU B，XU Y，YANG X，et al.Bilateral grid learning for stereo matching network[C]//2021 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2021.
[32] KENDALL A，MARTIROSYAN H，DASGUPTA S，et al.End-to-end learning of geometry and context for deep stereo regression[J].arXiv：1703.04309，2017.
[33] MAYER N，ILG E，HAUSSER P，et al.A large dataset to train convolutional networks for disparity，optical flow，and scene flow estimation[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2016.
[34] REN Z，YAN J，NI B，et al.Unsupervised deep learning for optical flow estimation[C]//Thirty-First AAAI Conference on Artificial Intelligence，2017.
[35] FLYNN J，NEULANDER I，PHILBIN J，et al.Deep stereo：learning to predict new views from the world’s imagery[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2016.
[36] XIE J，GIRSHICK R，FARHADI A.Deep3D：fully automatic 2D-to-3D video conversion with deep convolutional neural networks[C]//European Conference on Computer Vision，2016.
[37] GARG R，BG V K，CARNEIRO G，et al.Unsupervised CNN for single view depth estimation：geometry to the rescue[J].arXiv：1603.04992，2016.
[38] GODARD C，AODHA O M，BROSTOW G J.Unsupervised monocular depth estimation with left-right consistency[C]//Computer Vision & Pattern Recognition，2017.
[39] XING M，XUN S，ZHOU M，et al.On building an accurate stereo matching system on graphics hardware[C]//IEEE International Conference on Computer Vision Workshops，2012.
[40] HIRSCHMüLLER H.Stereo processing by semi-global matching and mutual information[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2008，30（2）：328-341.
[41] RHEMANN C，HOSNI A，BLEYER M，et al.Fast cost-volume filtering for visual correspondence and beyond[C]//IEEE Conference on Computer Vision & Pattern Recognition，2011.
[42] HE K，JIAN S，TANG X.Guided image filtering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2013，35（6）：1397-1409.
[43] ZHANG F，WAH B W.Supplementary meta-learning：towards a dynamic model for deep neural networks[C]//2017 IEEE International Conference on Computer Vision（ICCV），2017.
[44] LIU C，CHEN L C，SCHROFF F，et al.Auto-DeepLab：hierarchical neural architecture search for semantic image segmentation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2019.
[45] SAIKIA T，MARRAKCHI Y，ZELA A，et al.AutoDispNet：improving disparity estimation with AutoML[C]//2019 IEEE/CVF International Conference on Computer Vision（ICCV），2019.
[46] CHEN W，GONG X，LIU X，et al.FasterSeg：searching for faster real-time semantic segmentation[C]//Proceedings of ICLR，2020.
[47] BESSE F，ROTHER C，FITZGIBBON A，et al.PMBP：PatchMatch belief propagation for correspondence field estimation[J].International Journal of Computer Vision，2014，110：2-13.
[48] KHAMIS S，FANELLO S R，RHEMANN C，et al.StereoNet：guided hierarchical refinement for real-time edge-aware depth prediction[C]//Proceedings of ECCV，2018.
[49] CHEN J，PARIS S，DURAND F.Real-time edge-aware image processing with the bilateral grid[J].ACM Transactions on Graphics，2007，26（3）：103.
[50] GEIGER A，LENZ P，URTASUN R.Are we ready for autonomous driving? The KITTI vision benchmark suite[C]//IEEE Conference on Computer Vision & Pattern Recognition，2012.
[51] MENZE M，GEIGER A.Object scene flow for autonomous vehicles[C]//IEEE Conference on Computer Vision & Pattern Recognition，2015.
[52] BAO W，WANG W，XU Y，et al.InStereo2K：a large real dataset for stereo matching in indoor scenes[J].Sciece China.Information Sciences，2020，63（11）：212101.
[53] SCHARSTEIN D，HIRSCHMüLLER H，KITAJIMA Y，et al.High-resolution stereo datasets with subpixel-accurate ground truth[C]//German Conference on Pattern Recognition，2014.