The problem of video frame interpolation is to increase the temporal resolution of a low frame-rate video, by interpolating novel frames between existing temporally sparse frames. This paper presents a self-supervised approach to video frame interpolation that requires only a single video. We pose the video as a set of layers. Each layer is parameterized by two implicit neural networks -- one for learning a static frame and the other for a time-varying motion field corresponding to video dynamics. Together they represent an occlusion-free subset of the scene with a pseudo-depth channel. To model inter-layer occlusions, all layers are lifted to the 2.5D space so that the frontal layer occludes distant layers. This is done by assigning each layer a depth channel, which we call `pseudo-depth', whose partial order defines the occlusion between layers. The pseudo-depths are converted to visibility values through a fully differentiable SoftMin function so that closer layers are more visible than layers in a distance. On the other hand, we parameterize the video motions by solving an ordinary differentiable equation (ODE) defined on a time-varying neural velocity field that guarantees valid motions. This implicit neural representation learns the video as a space-time continuum, allowing frame interpolation at any temporal resolution. We demonstrate the effectiveness of our method on real-world datasets, where our method achieves comparable performance to state-of-the-arts that require ground truth labels for training.
翻译:视频框架的内插问题是,通过在现有的时间稀少的框架中插入新框架,提高低框架视频的时间分辨率。 本文展示了一种自我监督的视频框架内插方法, 只需要一个视频。 我们将视频作为一组层。 每个层由两个隐含的神经网络进行参数化 -- -- 一个用于学习静态框架,另一个用于一个与视频动态相对应的时间变化运动场。 它们加在一起, 代表着场景中一个无隐蔽的无隐蔽区块, 并有一个假深的频道。 对于模拟跨层的隐蔽区, 所有层都升至 2.5D 空间, 以便让前层的隐蔽层内隐蔽区间图。 这是通过给每个层指定一个深度频道, 我们称之为“ 假深层” 。 每个层由两个隐含的神经网络进行参数化。 假深层通过一个完全不同的 SoftMin 函数转换为可见度值, 以便更近的层比分层更明显。 另一方面, 我们将视频的外层都放大了视频运动, 通过一个普通的可比较的直观的图像模型, 来显示我们平面的平面的平面的平流的平面的平流的平面的平面的平面图。