We introduce a way to learn to estimate a scene representation from a single image by predicting a low-dimensional subspace of optical flow for each training example, which encompasses the variety of possible camera and object movement. Supervision is provided by a novel loss which measures the distance between this predicted flow subspace and an observed optical flow. This provides a new approach to learning scene representation tasks, such as monocular depth prediction or instance segmentation, in an unsupervised fashion using in-the-wild input videos without requiring camera poses, intrinsics, or an explicit multi-view stereo step. We evaluate our method in multiple settings, including an indoor depth prediction task where it achieves comparable performance to recent methods trained with more supervision.
翻译:我们引入了一种方法,通过预测每个培训实例的光学流量的低维次空间,包括各种可能的照相机和物体移动,来从单一图像中估计场景示意图,我们通过预测每个培训实例的光学流量的低维次空间,对场景示示意图进行估计。我们通过一种新颖的亏损进行监督,以测量预测的流量子空间与观测到的光学流量之间的距离。这提供了一种新的方法来学习场景示意图任务,例如单眼深度预测或实例分解,以不受监督的方式,使用现场输入视频,而不要求照相机的外罩、内含或明确的多视立体步骤。我们评估了我们在多个环境中的方法,包括室内深度预测任务,其性能可以与经过更多监督培训的近期方法相比。