We propose a method to train deep networks to decompose videos into 3D geometry (camera and depth), moving objects, and their motions, with no supervision. We build on the idea of view synthesis, which uses classical camera geometry to re-render a source image from a different point-of-view, specified by a predicted relative pose and depth map. By minimizing the error between the synthetic image and the corresponding real image in a video, the deep network that predicts pose and depth can be trained completely unsupervised. However, the view synthesis equations rely on a strong assumption: that objects do not move. This rigid-world assumption limits the predictive power, and rules out learning about objects automatically. We propose a simple solution: minimize the error on small regions of the image instead. While the scene as a whole may be non-rigid, it is always possible to find small regions that are approximately rigid, such as inside a moving object. Our network can then predict different poses for each region, in a sliding window from a learned dense pose map. This represents a significantly richer model, including 6D object motions, with little additional complexity. We achieve very competitive performance on unsupervised odometry and depth prediction on KITTI. We also demonstrate new capabilities on EPIC-Kitchens, a challenging dataset of indoor videos, where there is no ground truth information for depth, odometry, object segmentation or motion. Yet all are recovered automatically by our method.
翻译:我们提出一个方法来训练深层网络,将视频分解成3D几何(相机和深度)、移动对象及其运动,而不受监督。我们以视觉合成理念为基础,即使用古典相机的几何方法,用预测的相对面貌和深度地图,从不同的角度重新生成源图像。通过尽可能减少合成图像与视频中相应真实图像之间的误差,预测形状和深度的深层网络可以完全不受监督地接受培训。然而,视觉合成方程式依赖于一个强有力的假设:物体不移动。这个僵硬的世界假设限制了预测力,并自动排除了对对象的学习。我们提出了一个简单的解决办法:将图像小区域的错误降到最低程度,取而代之。虽然整个场景可能是不固定的,但总有可能发现小区域非常僵硬,例如移动的物体内。我们的网络然后可以预测每个区域的不同姿势,从一个已知的密度方形形形图的滑动窗口中预测。这代表一个更富得多的模型,包括6D对象运动,其复杂性很小。我们提出了一个简单得多的室内物体运动。我们提出了一个简单的方法:在图像的深度上取得非常有竞争力的性的业绩,我们在地面上,我们在地面上进行不精确的预测。