We present a frame interpolation algorithm that synthesizes multiple intermediate frames from two input images with large in-between motion. Recent methods use multiple networks to estimate optical flow or depth and a separate network dedicated to frame synthesis. This is often complex and requires scarce optical flow or depth ground-truth. In this work, we present a single unified network, distinguished by a multi-scale feature extractor that shares weights at all scales, and is trainable from frames alone. To synthesize crisp and pleasing frames, we propose to optimize our network with the Gram matrix loss that measures the correlation difference between feature maps. Our approach outperforms state-of-the-art methods on the Xiph large motion benchmark. We also achieve higher scores on Vimeo-90K, Middlebury and UCF101, when comparing to methods that use perceptual losses. We study the effect of weight sharing and of training with datasets of increasing motion range. Finally, we demonstrate our model's effectiveness in synthesizing high quality and temporally coherent videos on a challenging near-duplicate photos dataset. Codes and pre-trained models are available at https://github.com/google-research/frame-interpolation.
翻译:我们提出一个框架内插算法,从两个输入图像中合成多个中间框架,在两个输入图像中,在运动之间有很大的间隔。最近的方法是使用多种网络来估计光学流或深度,并建立一个单独的网络来进行框架合成。这往往很复杂,需要很少的光学流或深度地面图。在这项工作中,我们提出了一个单一的统一网络,以一个在所有尺度上共享权重的多尺度特征提取器为区别,并且仅从框架进行训练。为了综合精确和令人愉快的框架,我们建议以测量地貌地图之间相关性差异的格拉姆矩阵损失来优化我们的网络。我们的方法优于Xip大型运动基准上的最新技术方法。我们在比较使用感知性损失的方法时,还取得了Vimeo-90K、Midlebury和UCFCF101的较高分数。我们研究了与日益增大的运动范围数据集共享重量和培训的影响。最后,我们展示了我们的模型在将高质量和时间一致的视频同步化方面的有效性,以测量具有挑战性的近似复制照片数据集。在http://gibreath/intertractionframmeline.