The majority of self-supervised monocular depth estimation methods focus on driving scenarios. We show that such methods generalize poorly to unseen complex indoor scenes, where objects are cluttered and arbitrarily arranged in the near field. To obtain more robustness, we propose a structure distillation approach to learn knacks from a pretrained depth estimator that produces structured but metric-agnostic depth due to its in-the-wild mixed-dataset training. By combining distillation with the self-supervised branch that learns metrics from left-right consistency, we attain structured and metric depth for generic indoor scenes and make inferences in real-time. To facilitate learning and evaluation, we collect SimSIN, a dataset from simulation with thousands of environments, and UniSIN, a dataset that contains about 500 real scan sequences of generic indoor environments. We experiment in both sim-to-real and real-to-real settings, and show improvements both qualitatively and quantitatively, as well as in downstream applications using our depth maps. This work provides a full study, covering methods, data, and applications. We believe the work lays a solid basis for practical indoor depth estimation via self-supervision.
翻译:自我监督的单眼深度估计方法大多侧重于驾驶场景。 我们显示,这些方法向不为人知的复杂室内场景普及到不为人知的复杂室内场景,这些场景的物体被杂乱和任意安排在近场进行。 为了更加稳健,我们提议了一种结构蒸馏方法,从一个经过预先训练的深度估计器中学习知识,该模拟器由于其在全场的混合数据集培训而产生结构化的、但计量的深度。通过将蒸馏与自我监督的分支结合起来,从左对立中学习衡量标准,我们获得了结构化和计量的深度,为通用室内场景提供了结构化和计量的深度,并实时作出推断。为了便于学习和评价,我们收集了来自数千个环境模拟的数据集SimSIN和UniSIN,这是一个包含大约500个普通室内环境的真实扫描序列的数据集。我们用深度地图进行试验,在质量和数量上都显示出改进,在下游应用中,我们用深度地图进行改进。 这项工作提供了全面的研究,涵盖方法、数据和应用的深度。 我们相信,通过内部深度进行可靠的估计。