Existing monocular depth estimation shows excellent robustness in the wild, but the affine-invariant prediction requires aligning with the ground truth globally while being converted into the metric depth. In this work, we firstly propose a modified locally weighted linear regression strategy to leverage sparse ground truth and generate a flexible depth transformation to correct the coarse misalignment brought by global recovery strategy. Applying this strategy, we achieve significant improvement (more than 50% at most) over most recent state-of-the-art methods on five zero-shot datasets. Moreover, we train a robust depth estimation model with 6.3 million data and analyze the training process by decoupling the inaccuracy into coarse misalignment inaccuracy and detail missing inaccuracy. As a result, our model based on ResNet50 even outperforms the state-of-the-art DPT ViT-Large model with the help of our recovery strategy. In addition to accuracy, the consistency is also boosted for simple per-frame video depth estimation. Compared with monocular depth estimation, robust video depth estimation, and depth completion methods, our pipeline obtains state-of-the-art performance on video depth estimation without any post-processing. Experiments of 3D scene reconstruction from consistent video depth are conducted for intuitive comparison as well.
翻译:现有单心深度估计显示,野外现有单心深度估计非常稳健,但是,对于这种偏差的预测需要在全球范围与地面真相保持一致,同时要转换成测量深度。在这项工作中,我们首先提出修改本地加权线性回归战略,以利用稀疏的地面真相,并产生灵活的深度转换,以纠正全球复苏战略带来的粗略不匹配。运用这一战略,我们大大改进了5个零发数据集的最新最新先进方法(最多50%以上)。此外,我们用630万数据来培养一个强力深度估计模型,并通过将不准确性与粗略的不匹配和细节缺失的不准确性脱钩来分析培训过程。因此,我们基于ResNet50的模型甚至比全球复苏战略的先进模型更优于先进的DPT Vit-Large模型。除了准确性外,还提高了对简单全局视频深度估计的一致性。与单面深度估计、强的视频深度估计和深度完成深度相比,我们没有精确性偏差的深度的精确度分析,因此,我们以ResNet50为基础的模型模型模型模型模型模型比了我们连续的深度的深度的深度的深度模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟的模拟模拟的模拟,获得了。