Novel view synthesis (NVS) and video prediction (VP) are typically considered disjoint tasks in computer vision. However, they can both be seen as ways to observe the spatial-temporal world: NVS aims to synthesize a scene from a new point of view, while VP aims to see a scene from a new point of time. These two tasks provide complementary signals to obtain a scene representation, as viewpoint changes from spatial observations inform depth, and temporal observations inform the motion of cameras and individual objects. Inspired by these observations, we propose to study the problem of Video Extrapolation in Space and Time (VEST). We propose a model that leverages the self-supervision and the complementary cues from both tasks, while existing methods can only solve one of them. Experiments show that our method achieves performance better than or comparable to several state-of-the-art NVS and VP methods on indoor and outdoor real-world datasets.
翻译:视觉合成(NVS)和视频预测(VP)通常被视为计算机视觉中的脱节任务,但两者都可被视为观察时空空间世界的方法:NVS旨在从新的角度对场景进行合成,而VP则旨在从新的时间点对场景进行观察。这两项任务提供了补充信号,以获得场景演示,因为从空间观测的深度看,从空间观测的变化看,时间观测为照相机和个别物体的运动提供了参考。根据这些观察,我们提议研究空间和时间视频外推问题。我们提议一种模型,利用自视和两个任务的补充提示,而现有方法只能解决其中之一。实验表明,我们的方法在室内和室外真实世界数据集上,其性能优于或可与一些最先进的VS和VP方法相比。