Video Frame Interpolation synthesizes non-existent images between adjacent frames, with the aim of providing a smooth and consistent visual experience. Two approaches for solving this challenging task are optical flow based and kernel-based methods. In existing works, optical flow based methods can provide accurate point-to-point motion description, however, they lack constraints on object structure. On the contrary, kernel-based methods focus on structural alignment, which relies on semantic and apparent features, but tends to blur results. Based on these observations, we propose a structure-motion based iterative fusion method. The framework is an end-to-end learnable structure with two stages. First, interpolated frames are synthesized by structure-based and motion-based learning branches respectively, then, an iterative refinement module is established via spatial and temporal feature integration. Inspired by the observation that audiences have different visual preferences on foreground and background objects, we for the first time propose to use saliency masks in the evaluation processes of the task of video frame interpolation. Experimental results on three typical benchmarks show that the proposed method achieves superior performance on all evaluation metrics over the state-of-the-art methods, even when our models are trained with only one-tenth of the data other methods use.
翻译:视频框架内插法综合了相邻框架之间不存在的图像,目的是提供一个平滑和一致的视觉经验。解决这一具有挑战性的任务的两种方法是光流和内核方法。在现有工作中,光流法可以提供准确的点对点运动描述,但是对物体结构缺乏限制。相反,内核法侧重于结构对齐,依靠语义和表面特征,但往往模糊结果。根据这些观察,我们建议一种基于结构运动的迭接聚合法。这个框架是一个端到端的可学习结构,分两个阶段。首先,基于结构的和基于内核的学习分支将集成一个中间框架。然后,通过空间和时间特征集成,建立一个迭接式的完善模块。根据关于观众对地表和背景对象有不同视觉偏好的看法,我们第一次提议在视频框架内插任务的评价过程中使用突出的面具。三个典型基准的实验结果显示,在经过培训的所有评价模型仅使用其他方法的情况下,我们的拟议方法在所有模型上取得了优异性业绩。