Video prediction is an extrapolation task that predicts future frames given past frames, and video frame interpolation is an interpolation task that estimates intermediate frames between two frames. We have witnessed the tremendous advancement of video frame interpolation, but the general video prediction in the wild is still an open question. Inspired by the photo-realistic results of video frame interpolation, we present a new optimization framework for video prediction via video frame interpolation, in which we solve an extrapolation problem based on an interpolation model. Our video prediction framework is based on optimization with a pretrained differentiable video frame interpolation module without the need for a training dataset, and thus there is no domain gap issue between training and test data. Also, our approach does not need any additional information such as semantic or instance maps, which makes our framework applicable to any video. Extensive experiments on the Cityscapes, KITTI, DAVIS, Middlebury, and Vimeo90K datasets show that our video prediction results are robust in general scenarios, and our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
翻译:视频图像预测是一项外推任务,它预测了过去框架下的未来框架,而视频框架内插是一项内插任务,它估计了两个框架之间的中间框架。我们已经目睹了视频框架内插的巨大进步,但野生一般视频预测仍然是一个尚未解决的问题。在视频框架内插的摄影现实结果的启发下,我们提出了一个新的优化框架,通过视频框架内插预测视频预测,在这个框架中,我们根据一个内插模型解决了外推问题。我们的视频预测框架基于一个经过预先训练的不同视频框架内插模块的优化,而不需要培训数据集,因此在培训和测试数据之间不存在任何域间差距问题。此外,我们的方法不需要任何额外的信息,例如语义或实例地图,这使得我们的框架适用于任何视频。在城市景景、KITTI、DAVIS、Miderbury和Vimeo90K数据集上的广泛实验表明,我们的视频预测结果在一般情景中是可靠的,我们的方法比其他视频预测方法要差,需要大量的培训数据或地震外信息。