We propose a method to infer a dense depth map from a single image, its calibration, and the associated sparse point cloud. In order to leverage existing models (teachers) that produce putative depth maps, we propose an adaptive knowledge distillation approach that yields a positive congruent training process, wherein a student model avoids learning the error modes of the teachers. In the absence of ground truth for model selection and training, our method, termed Monitored Distillation, allows a student to exploit a blind ensemble of teachers by selectively learning from predictions that best minimize the reconstruction error for a given image. Monitored Distillation yields a distilled depth map and a confidence map, or ``monitor'', for how well a prediction from a particular teacher fits the observed image. The monitor adaptively weights the distilled depth where if all of the teachers exhibit high residuals, the standard unsupervised image reconstruction loss takes over as the supervisory signal. On indoor scenes (VOID), we outperform blind ensembling baselines by 17.53% and unsupervised methods by 24.25%; we boast a 79% model size reduction while maintaining comparable performance to the best supervised method. For outdoors (KITTI), we tie for 5th overall on the benchmark despite not using ground truth. Code available at: https://github.com/alexklwong/mondi-python.
翻译:我们建议一种方法,从单一图像、其校准和相关的稀有点云中推断出密密深图。为了利用产生模拟深度地图的现有模型(教师),我们建议一种适应性知识蒸馏方法,使一个学生模型能够避免学习教师的错误模式。在模型选择和培训缺乏地面真相的情况下,我们称为监测蒸馏的方法,允许学生利用一个盲的教师集合,有选择地从预测中学习如何将某一图像的重建错误降到最低。监测蒸馏产生一种蒸馏深度图和信心图,或“监测性图”,以便从某个教师那里作出与所观察的图像相匹配的预测。在监测性能模型的深度方面,如果所有教师都表现出高残余,标准、不受监督的图像重建损失将作为监督信号。在室内场(VOID)中,我们比盲人的基线高出了17.53%,而未受监督的方法为24.25%;我们通过监督性地标定了一个79 %的模型,尽管有地面标准。