In the recent years, many methods demonstrated the ability of neural networks to learn depth and pose changes in a sequence of images, using only self-supervision as the training signal. Whilst the networks achieve good performance, the often over-looked detail is that due to the inherent ambiguity of monocular vision they predict depth up to an unknown scaling factor. The scaling factor is then typically obtained from the LiDAR ground truth at test time, which severely limits practical applications of these methods. In this paper, we show that incorporating prior information about the camera configuration and the environment, we can remove the scale ambiguity and predict depth directly, still using the self-supervised formulation and not relying on any additional sensors.
翻译:近年来,许多方法显示神经网络有能力学习深度并改变一系列图像,只使用自我监督作为培训信号。虽然这些网络取得良好业绩,但经常被过分忽视的细节是,由于单眼视觉的内在模糊性,它们预测深度到一个未知的缩放因子。然后,缩放因子通常在试验时从LiDAR地面事实中得出,这严重限制了这些方法的实际应用。在本文中,我们表明,如果事先纳入关于相机配置和环境的信息,我们就可以直接消除规模模糊性,预测深度,仍然使用自监督的配方,不依赖任何其他传感器。