Existing monocular depth estimation methods have achieved excellent robustness in diverse scenes, but they can only retrieve affine-invariant depth, up to an unknown scale and shift. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency. To solve this problem, we propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points, which ensures the scale consistency along consecutive frames. Extensive experiments show that our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge over 6.3 million RGBD images to train strong and robust depth models. Our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combining with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing the simple per-frame prediction over a video, the accurate 3D scene shape can be recovered.
翻译:现有的单目深度估计方法能够在各种不同场景下取得出色的稳健性,但它们只能检索到仿射不变的深度,最多只能达到未知比例和移位。然而,在某些基于视频的场景中,例如基于视频的深度估计和从视频中重建三维场景,驻留在逐帧预测中的未知比例和移位可能会导致深度不一致性。为了解决这个问题,我们提出了一种局部加权线性回归方法,用非常稀疏的锚点恢复比例和移位,从而确保连续帧之间的比例一致性。广泛的实验表明,我们的方法可以将现有最先进方法的性能提高高达50%以上,适用于多种零尝试基准。此外,我们合并了超过630万个RGBD图像来训练强大且稳健的深度模型。我们的ResNet50-backbone模型甚至超越了最先进的DPT ViT-Large模型。结合基于几何的重建方法,我们制定了一种新的密集三维场景重建管道,既从稀疏点的比例一致性中受益,又从单目方法的稳健性中受益。通过对视频执行简单的逐帧预测,可以恢复精确的三维场景形状。