We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are available at https://github.com/SeokjuLee/Insta-DM .
翻译:我们提出了一个端到端联合培训框架,明确模型6-DoF的多动态物体、自我感动和深度的模型,在一个没有监督的单独摄像头中进行。我们的技术贡献是三重的。首先,我们强调每个僵硬物体个人运动模型的反向和前向投影之间的根本区别,并使用神经前向投影模块提出几何正确的投影管道。第二,我们设计一个统一的实测光度和几何一致性损失,从整体上为每个背景和对象区域强加自我监督的信号。最后,我们采用一个通用的自动批注计划,利用任何现成的分解和光学流模型制作视频实例分解图,作为我们培训管道的投入。这些拟议要素在详细的反动研究中得到验证。通过对KITTI和城市景景数据集进行的广泛实验,我们的框架显示超越了状态的深度和运动估计方法。我们的代码、数据集和模型可在 https://github.com/SeokjuLee/InstaDM。