Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (2): 215-224.DOI: 10.3778/j.issn.1002-8331.2009-0061

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Semantic-Guidance Multi-scale Network for Multi-view Stereo

YUN Jingyang, LI Xuehua, XIANG Wei   

  1. 1.School of Information and Communication Engineering, Beijing Information Science and Technology University, Beijing 100101, China
    2.College of Science and Engineering, James Cook University, Cairns, Queensland 4878, Commonwealth of Australia
  • Online:2022-01-15 Published:2022-01-18



  1. 1.北京信息科技大学 信息与通信工程学院,北京 100101
    2.詹姆斯库克大学 科学与工程学院,昆士兰 凯恩斯 4878

Abstract: The current multi-view depth estimation methods based on deep learning can be roughly divided into two categories according to the type of convolution neural network. Among them, the model based on the 2D convolutional network has a fast prediction speed with a lower prediction accuracy while the model based on the 3D convolutional network achieves higher prediction accompanying more hardwares consumption. Also, the transformation of the external parameters of the camera in the multi-view make it impossible for the model to generate high-precision prediction results on the edges of objects, occlusions or textureless areas. In response to the above problems, this paper proposes a multi-scale semantic-oriented multi-view depth estimation algorithm based on 3D convolution which can reduce hardware demand while ensuring prediction accuracy. At the same time, for areas such as occlusion or textureless areas, the image features extracted by the network itself are used as a prior guidance information to enhance the network’s perception of global information and a multi-scale fusion method is combined to enhance the robustness of the network. In the testing comparison of the public datasets, the method proposed in this paper predicts the depth map results more clearly, also can handle sensitive areas such as the object boundaries or occlusion regions in picture.

Key words: multi-view stereo, depth estimation, deep neural network, supervised learning

摘要: 目前利用深度学习进行多视图深度估计的方法可以根据卷积类型可以大致分为两类。其中,基于2D卷积网络的模型预测计算速度快,但预测精度较低;基于3D卷积网络的模型预测精度高,却存在高硬件消耗。同时,多视图中相机外部参数的变化使得模型无法在物体边缘、遮挡或纹理较弱区域生成高精度预测结果。针对上述问题,提出了基于3D卷积的语义导向多尺度多视图深度估计算法,在保证预测精度的同时降低硬件消耗。同时针对遮挡、纹理较弱等区域,利用网络自身提取的图片特征作为先验导向信息,增强网络对全局信息的感知,结合多尺度融合方法增强网络的鲁棒性。在公开数据集的测试对比中,提出的方法预测深度图结果更加清晰,并能有效地应对图片中物体边界、遮挡等区域。

关键词: 多视图立体匹配, 深度估计, 深度神经网络, 监督学习