Dense depth estimation is essential to scene-understanding for autonomous driving. However, recent self-supervised approaches on monocular videos suffer from scale-inconsistency across long sequences. Utilizing data from the ubiquitously copresent global positioning systems (GPS), we tackle this challenge by proposing a dynamically-weighted GPS-to-Scale (g2s) loss to complement the appearance-based losses. We emphasize that the GPS is needed only during the multimodal training, and not at inference. The relative distance between frames captured through the GPS provides a scale signal that is independent of the camera setup and scene distribution, resulting in richer learned feature representations. Through extensive evaluation on multiple datasets, we demonstrate scale-consistent and -aware depth estimation during inference, improving the performance even when training with low-frequency GPS data.
翻译:深度测算对于了解自主驾驶的场景至关重要。然而,最近对单视视频的自我监督方法在长序列中存在规模不一致的情况。 利用无处不在的共生全球定位系统的数据,我们通过提出动态加权全球定位系统至超标(g2s)损失来补充外观损失来应对这一挑战。我们强调全球定位系统仅在多式培训期间才需要,而不是推断。通过全球定位系统捕获的框架之间的相对距离提供了与摄像机设置和场景分布无关的尺度信号,从而产生较丰富的学习特征表现。 通过对多个数据集的广泛评估,我们在推断过程中展示了比例一致和深度估计,即使培训使用低频全球定位系统数据,也提高了性能。