Recently, much attention has been drawn to learning the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion. One of the most challenging aspects of this task is handling the independently moving objects as they break the rigid-scene assumption. For the first time, we show that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. Our proposed moving object (MO) masks, which are induced by shifted positional information (SPI) and referred to as `SPIMO' masks, are very robust and consistently remove the independently moving objects in the scenes, allowing for better learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for our depth discretization. Finally, we employ existing boosting techniques in a new way to further self-supervise the depth of the moving objects. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with almost 8.5x fewer parameters than the previous works that learn from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method.
翻译:最近,人们非常关注以完全自我监督的方式从单视视频中学习场景的基本 3D 结构。 这项任务中最具挑战性的一个方面是处理独立移动的物体,因为它们打破了僵硬的场景假设。 我们第一次展示了像素定位信息可以用来从视频中学习 SVDE( Singel View Explectimation ) 。 我们提议的移动对象面罩(MO) 是由移动定位信息(SPI) 和被称为“SPIMO” 面具所引发的, 非常强大且始终如一地清除场景中独立移动的物体, 以便从视频中更好地学习 SVDE 。 此外, 我们引入了一个新的适应性四分立化方案, 为我们的深度离散性设定了最佳的全像素四分解曲线。 最后, 我们用新的提振动技术来进一步自我超越移动对象的深度。 有了这些特征, 我们的输油管可以防止移动对象, 并且向高分辨率图像进行普及, 即使是经过小的修补, 能够从视频中更好地学习 SVDE 。 我们的州- 演示了比城市的原实验模型的模型 。