Semantic-Guidance Multi-scale Network for Multi-view Stereo

doi:10.3778/j.issn.1002-8331.2009-0061

Abstract

Abstract: The current multi-view depth estimation methods based on deep learning can be roughly divided into two categories according to the type of convolution neural network. Among them, the model based on the 2D convolutional network has a fast prediction speed with a lower prediction accuracy while the model based on the 3D convolutional network achieves higher prediction accompanying more hardwares consumption. Also, the transformation of the external parameters of the camera in the multi-view make it impossible for the model to generate high-precision prediction results on the edges of objects, occlusions or textureless areas. In response to the above problems, this paper proposes a multi-scale semantic-oriented multi-view depth estimation algorithm based on 3D convolution which can reduce hardware demand while ensuring prediction accuracy. At the same time, for areas such as occlusion or textureless areas, the image features extracted by the network itself are used as a prior guidance information to enhance the network’s perception of global information and a multi-scale fusion method is combined to enhance the robustness of the network. In the testing comparison of the public datasets, the method proposed in this paper predicts the depth map results more clearly, also can handle sensitive areas such as the object boundaries or occlusion regions in picture.

Key words: multi-view stereo, depth estimation, deep neural network, supervised learning

摘要： 目前利用深度学习进行多视图深度估计的方法可以根据卷积类型可以大致分为两类。其中，基于2D卷积网络的模型预测计算速度快，但预测精度较低；基于3D卷积网络的模型预测精度高，却存在高硬件消耗。同时，多视图中相机外部参数的变化使得模型无法在物体边缘、遮挡或纹理较弱区域生成高精度预测结果。针对上述问题，提出了基于3D卷积的语义导向多尺度多视图深度估计算法，在保证预测精度的同时降低硬件消耗。同时针对遮挡、纹理较弱等区域，利用网络自身提取的图片特征作为先验导向信息，增强网络对全局信息的感知，结合多尺度融合方法增强网络的鲁棒性。在公开数据集的测试对比中，提出的方法预测深度图结果更加清晰，并能有效地应对图片中物体边界、遮挡等区域。

关键词: 多视图立体匹配, 深度估计, 深度神经网络, 监督学习

YUN Jingyang, LI Xuehua, XIANG Wei. Semantic-Guidance Multi-scale Network for Multi-view Stereo[J]. Computer Engineering and Applications, 2022, 58(2): 215-224.

贠璟扬, 李学华, 向维. 语义导向多尺度多视图深度估计算法[J]. 计算机工程与应用, 2022, 58(2): 215-224.

References

[1] HIRSCHMULLER H.Stereo processing by semiglobal matching and mutual information[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2007，30（2）：328-341.
[2] VEKSLER O.Fast variable window for stereo correspondence using integral images[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2003.
[3] COLLINS R T.A space-sweep approach to true multi-image matching[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition，1996：358-363.
[4] GALLUP D，FRAHM J，MORDOHAI P，et al.Real-time plane-sweeping stereo with multiple sweeping directions[C]//IEEE Conference on Computer Vision and Pattern Recognition，2007.
[5] BLEYER M，RHEMANN C，ROTHER C.Patchmatch stereo-stereo matching with slanted support windows[C]//Proceedings of the British Machine Vision Conference，2011.
[6] KRIZHEVSKY A，SUTSKEVER I，HINTON G E.Imagenet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems，2012：1097-1105.
[7] KIRILLOV A，GIRSHICK R，HE K，et al.Panoptic feature pyramid networks[C]//IEEE Conference on Computer Vision and Pattern Recognition，2019.
[8] GIRSHICK R.Fast R-CNN[C]//IEEE International Conference on Computer Vision，2015.
[9] UMMENHOFER B，ZHOU H，UHRIG J，et al.Demon：depth and motion network for learning monocular stereo[C]//IEEE Conference on Computer Vision and Pattern Recognition，2017.
[10] HUANG P H，MATZEN K，KOPF J，et al.Deepmvs：learning multi-view stereopsis[C]//IEEE Conference on Computer Vision and Pattern Recognition，2018：2821-2830.
[11] YAO Y，LUO Z，LI S，et al.Mvsnet：depth inference for unstructured multi-view stereo[C]//European Conference on Computer Vision，2018.
[12] IM S，JEON H G，LIN S，et al.Dpsnet：end-to-end deep plane sweep stereo[C]//International Conference on Learning Representations，2019.
[13] HE K，ZHANG X，REN S，et al.Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2015，37（9）：1904-1916.
[14] EIGEN D，PUHRSCH C，FERGUS R.Depth map prediction from a single image using a multi-scale deep network[C]//Advances in Neural Information Processing Systems，2014：2366-2374.
[15] EIGEN D，PUHRSCH C，FERGUS R.Depth map prediction from a single image using a multi-scale deep network[C]//Advances in Neural Information Processing Systems，2014：2366-2374.
[16] LAINA I，RUPPRECHT C，BELAGIANNIS V，et al.Deeper depth prediction with fully convolutional residual networks[C]//2016 Fourth International Conference on 3D Vision，2016：239-248.
[17] 夏梦琪，郝琨，赵璐.基于全卷积编解码网络的单目图像深度估计[J].计算机工程与应用，2021，57（14）：231-236.
XIA M Q，HAO K，ZHAO L.Monocular image depth estimation based on fully convolutional encoder-decoder network[J].Computer Engineering and Applications，2021，57（14）：231-236.
[18] 刘凝香，赵洋，王荣刚.基于自注意力机制的多阶段无监督单目深度估计网络[J].信号处理，2020，36（9）：1450-1456.
LIU N X，ZHAO Y，WANG R G.Self-attention based multi-stage network for unsupervised monocular depth estimation[J].Journal of Signal Processing，2020，36（9）：1450-1456.
[19] DIJK T，CROON G.How do neural networks see depth in single images[C]//IEEE International Conference on Computer Vision，2019.
[20] SCHARSTEIN D，SZELISKI R.A taxonomy and evaluation of dense two-frame stereo correspondence algorithms[J].International Journal of Computer Vision，2002，47：7-42.
[21] ZBONTAR J，LECUN Y.Stereo matching by training a convolutional neural network to compare image patches[J].The Journal of Machine Learning Research，2016，17（1）：2287-2318.
[22] MAYER N，ILG E，HAUSSER P，et al.A large dataset to train convolutional networks for disparity，optical flow，and scene flow estimation[C]//IEEE Conference on Computer Vision and Pattern Recognition，2016.
[23] DOSOVITSKIY A，FISCHER P，ILG E，et al.Flownet：learning optical flow with convolutional networks[C]//IEEE International Conference on Computer Vision，2015.
[24] KENDALL A，MARTIROSYAN H，DASGUPTA S，et al.End-to-end learning of geometry and context for deep stereo regression[C]//IEEE International Conference on Computer Vision，2017：66-75.
[25] CHANG J R，CHEN Y S.Pyramid stereo matching network[C]//IEEE Conference on Computer Vision and Pattern Recognition，2018：5410-5418.
[26] POGGI M，PALLOTTI D，TOSI F，et al.Guided stereo matching[C]//IEEE Conference on Computer Vision and Pattern Recognition，2019.
[27] ZHANG F，PRISACARIU V，YANG R，et al.Ga-net：Guided aggregation net for end-to-end stereo matching[C]//IEEE Conference on Computer Vision and Pattern Recognition，2019.
[28] ZHAO H，SHI J，QI X，et al.Pyramid scene parsing network[C]//IEEE Conference on Computer Vision and Pattern Recognition，2017.
[29] LIN T Y，DOLL′AR P，GIRSHICK R，et al.Feature pyramid networks for object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition，Beijing，2017.
[30] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[31] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//IEEE Conference on Computer Vision and Pattern Recognition，2018.
[32] HE K，SUN J，TANG X.Guided image filtering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2013，35（6）：1397-1409.
[33] KHAMIS S，FANELLO S，RHEMANN C，et al.Stereonet：guided hierarchical refinement for real-time edge-aware depth prediction[C]//European Conference on Computer Vision，Munich，Germany，2018.
[34] SCHOPS T，SCHONBERGER J L，GALLIANI S，et al.A multi-view stereo benchmark with high-resolution images and multi-camera videos[C]//IEEE Conference on Computer Vision and Pattern Recognition，2017.
[35] KINGMA D P，BA J.Adam：a method for stochastic optimization[C]//International Conference on Learning Representations，San Diego，CA，USA，May 7-9，2015.
[36] SCHONBERGER J L，FRAHM J M.Structure-from-motion revisited[C]//IEEE Conference on Computer Vision and Pattern Recognition，2016.