This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on the Cityscapes (resp. KITTI) dataset show that WALDO significantly outperforms prior works with, e.g., 3, 27, and 51% (resp. 5, 20 and 11%) relative improvement in SSIM, LPIPS and FVD metrics. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage https://16lemoing.github.io/waldo.
翻译:本文介绍了WALDO(WARPDO),这是从过去的角度预测未来视频框架的一种新颖的方法。个人图像被分解成多层,将物体面罩和一组小控制点合并在一起。层结构在每一视频的所有框中共享,以建立密集的跨框架连接。复杂的场景动议通过结合与单个层相关的参数几何转换进行模型化,视频合成被细分为发现与过去框架相关的层层,预测未来图像的相应变异,并相应扭曲相关对象区域,并填充其余图像部分。关于城市景观(重编KITTI)的大规模实验显示,WALDO明显地超越了先前的工作,例如3、27和51%(重编5、20和11%)对SSIM、LPIPS和FVD指标的相对改进。代码、预先培训的模型和通过我们的方法合成的视频样本,可在项目网页https://16lemoing.github.io/waldo中找到。