Self-supervised monocular depth estimation networks are trained to predict scene depth using nearby frames as a supervision signal during training. However, for many applications, sequence information in the form of video frames is also available at test time. The vast majority of monocular networks do not make use of this extra signal, thus ignoring valuable information that could be used to improve the predicted depth. Those that do, either use computationally expensive test-time refinement techniques or off-the-shelf recurrent networks, which only indirectly make use of the geometric information that is inherently available. We propose ManyDepth, an adaptive approach to dense depth estimation that can make use of sequence information at test time, when it is available. Taking inspiration from multi-view stereo, we propose a deep end-to-end cost volume based approach that is trained using self-supervision only. We present a novel consistency loss that encourages the network to ignore the cost volume when it is deemed unreliable, e.g. in the case of moving objects, and an augmentation scheme to cope with static cameras. Our detailed experiments on both KITTI and Cityscapes show that we outperform all published self-supervised baselines, including those that use single or multiple frames at test time.
翻译:自我监督的单心深度估计网络经过培训,利用附近的框架作为培训过程中的监督信号,预测现场深度。然而,对于许多应用程序,在测试时也可以获得视频框架形式的序列信息。绝大多数单眼网络不使用这一额外的信号,从而忽视了可用于改进预测深度的宝贵信息。那些网络,或者使用成本昂贵的测试时间改进技术,或者使用非现成的经常性网络,这些网络只能间接地利用固有的几何信息。我们建议许多Depeh,即对密集深度估计的适应性方法,在测试时间使用序列信息时,如果有的话,也可以使用视频框架。我们从多视立体获得灵感,建议采用基于深度端到端的成本量方法,仅使用自我监督视角进行培训。我们提出了新的一致性损失,鼓励网络在认为不可靠时忽略成本量,例如移动对象的情况,以及应对静态照相机的增强计划。我们在KITTI和城景进行的详细实验后显示,我们超越了所有已公布的自我监督基线,包括使用单一基准。