Existing monocular depth estimation methods have achieved excellent robustness in diverse scenes, but they can only retrieve affine-invariant depth, up to an unknown scale and shift. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency. To solve this problem, we propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points, which ensures the scale consistency along consecutive frames. Extensive experiments show that our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge over 6.3 million RGBD images to train strong and robust depth models. Our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combining with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing the simple per-frame prediction over a video, the accurate 3D scene shape can be recovered.
翻译:现有单体深度估计方法在不同场景中达到了极强的稳健度,但它们只能在不同的场景中取得超强的稳健度,达到未知的规模和变化。然而,在视频深度估计和三维场景重建等一些基于视频的情景中,根据每框架预测的未知规模和变化可能导致深度不一致。为了解决这个问题,我们提议了一种本地加权线性回归方法,以恢复规模和变化,同时使用非常稀少的锚点,确保连续框架的尺度一致性。广泛的实验表明,我们的方法最多可以在几个零点基准的基础上提高50%的现有最先进方法的性能。此外,我们合并了630万 RGBD 图像,以培养强大和稳健的深度模型。我们制作的ResNet50-backone模型甚至超过了先进的DPT Vit-Large模型。与基于几何测量的重建方法相结合,我们制定了一个新的密度为3D的场景重建管道,这得益于稀有点的规模一致性和单视距方法的稳健度。通过对图像进行简单的一幅图像的精确度预测,可以恢复。