We propose a method to train deep networks to decompose videos into 3D geometry (camera and depth), moving objects, and their motions, with no supervision. We build on the idea of view synthesis, which uses classical camera geometry to re-render a source image from a different point-of-view, specified by a predicted relative pose and depth map. By minimizing the error between the synthetic image and the corresponding real image in a video, the deep network that predicts pose and depth can be trained completely unsupervised. However, the view synthesis equations rely on a strong assumption: that objects do not move. This rigid-world assumption limits the predictive power, and rules out learning about objects automatically. We propose a simple solution: minimize the error on small regions of the image instead. While the scene as a whole may be non-rigid, it is always possible to find small regions that are approximately rigid, such as inside a moving object. Our network can then predict different poses for each region, in a sliding window. This represents a significantly richer model, including 6D object motions, with little additional complexity. We establish new state-of-the-art results on unsupervised odometry and depth prediction on KITTI. We also demonstrate new capabilities on EPIC-Kitchens, a challenging dataset of indoor videos, where there is no ground truth information for depth, odometry, object segmentation or motion. Yet all are recovered automatically by our method.
翻译:我们提出一个方法来训练深层网络,将视频分解成3D几何(相机和深度)、移动对象及其运动,而不受监督。我们以视觉合成理念为基础,即使用古典相机的几何方法,用预测的相对面貌和深度地图,从不同的角度重新生成源图像。通过尽可能减少合成图像与视频中相应真实图像之间的误差,预测形状和深度的深层网络可以完全不受监督地接受培训。然而,视觉合成方程式依赖于一个强有力的假设:物体不移动。这个僵硬的世界假设限制了预测力,并自动排除了对对象的学习。我们提出了一个简单的解决办法:将图像小区域的错误降到最低程度,而将整个场景不固定,但总有可能发现一些不太固定的小型区域,例如移动对象。我们的网络可以在一个滑动窗口中预测每个区域的不同姿势。这代表一个非常丰富的模型,包括6D对象移动,且不那么复杂。我们用新的状态-图像深度的深度,我们也可以在不具有挑战性地平面图像的深度上建立新的数据,我们用来对地面的预测。