Self-supervised depth estimation for indoor environments is more challenging than its outdoor counterpart in at least the following two aspects: (i) the depth range of indoor sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues, whereas the maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky; (ii) the indoor sequences contain much more rotational motions, which cause difficulties for the pose network, while the motions of outdoor sequences are pre-dominantly translational, especially for driving datasets such as KITTI. In this paper, special considerations are given to those challenges and a set of good practices are consolidated for improving the performance of self-supervised monocular depth estimation in indoor environments. The proposed method mainly consists of two novel modules, \ie, a depth factorization module and a residual pose estimation module, each of which is designed to respectively tackle the aforementioned challenges. The effectiveness of each module is shown through a carefully conducted ablation study and the demonstration of the state-of-the-art performance on two indoor datasets, \ie, EuRoC and NYUv2.
翻译:室内环境自我监督的深度估计比室外环境的深度估计更具有挑战性,至少在以下两个方面:(一) 室内序列的深度范围在不同的框架之间差异很大,使得深度网络难以产生一致的深度提示,而室外场景的最大距离大多与照相机通常看到天空时的距离相同;(二) 室内序列包含的旋转动作要多得多,给造型网络造成困难,而室外序列的动作是先导式翻译,特别是像KITTI这样的驾驶数据集。本文对这些挑战给予了特别的考虑,并整合了一套良好做法,以改进室内环境中自我监督单眼深度估计的性能。拟议方法主要由两个新的模块组成,即\ie、深度因数模块和一个残余因子估计模块,每个模块的设计都分别是为了应对上述挑战。每个模块的有效性通过仔细进行的反动研究以及两个室内数据集(\ie、EuRoC和NY)的状态表现演示来显示。