Despite significant progress made in the past few years, challenges remain for depth estimation using a single monocular image. First, it is nontrivial to train a metric-depth prediction model that can generalize well to diverse scenes mainly due to limited training data. Thus, researchers have built large-scale relative depth datasets that are much easier to collect. However, existing relative depth estimation models often fail to recover accurate 3D scene shapes due to the unknown depth shift caused by training with the relative depth data. We tackle this problem here and attempt to estimate accurate scene shapes by training on large-scale relative depth data, and estimating the depth shift. To do so, we propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then exploits 3D point cloud data to predict the depth shift and the camera's focal length that allow us to recover 3D scene shapes. As the two modules are trained separately, we do not need strictly paired training data. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to improve training with relative depth annotation. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot evaluation. Code is available at: https://git.io/Depth
翻译:尽管在过去几年里取得了显著进展,但使用单一单体图像进行深度估计仍面临挑战。首先,培训一个能够广泛推广到不同场景的、主要由于培训数据有限而导致广泛推广到不同场景的量度深度预测模型并非易事。因此,研究人员建造了大规模相对深度数据集,这些数据集比较容易收集。然而,现有的相对深度估算模型往往无法恢复精确的三维场景形状,原因是培训与相对深度数据相比导致的深度变化不明。我们在这里处理这一问题,试图通过大规模相对深度数据培训来估计准确的场景形状,并试图通过深度变化来估计准确的场景形状。为此,我们提出一个两阶段框架,首先预测深度到未知的规模,然后从单一单体图像转换,然后利用三维点云数据预测深度变化和摄像头的焦距,从而使我们能够恢复三维场景形状。由于两个模块是单独培训,我们不需要严格配对培训数据。此外,我们提议用一个图像级回归损失和正常的几何损失来改进以相对深度进行的培训。我们测试了9度/方向的深度评估。我们的数据是用来改进了现有的数据。我们测试了目前的数据深度评估。我们测试了。