Most deep learning methods for video frame interpolation consist of three main components: feature extraction, motion estimation, and image synthesis. Existing approaches are mainly distinguishable in terms of how these modules are designed. However, when interpolating high-resolution images, e.g. at 4K, the design choices for achieving high accuracy within reasonable memory requirements are limited. The feature extraction layers help to compress the input and extract relevant information for the latter stages, such as motion estimation. However, these layers are often costly in parameters, computation time, and memory. We show how ideas from dimensionality reduction combined with a lightweight optimization can be used to compress the input representation while keeping the extracted information suitable for frame interpolation. Further, we require neither a pretrained flow network nor a synthesis network, additionally reducing the number of trainable parameters and required memory. When evaluating on three 4K benchmarks, we achieve state-of-the-art image quality among the methods without pretrained flow while having the lowest network complexity and memory requirements overall.
翻译:视频框架内插的多数深层次学习方法由三个主要部分组成:特征提取、运动估计和图像合成。现有方法主要区别于这些模块的设计方式。然而,在4K等高分辨率图像的内插时,在合理内存要求范围内实现高精度的设计选择有限。特征提取层有助于压缩输入,并为后几个阶段(如运动估计)提取相关信息。然而,这些层在参数、计算时间和记忆方面往往成本高昂。我们展示如何利用从维度减缩以及轻量化优化中得出的理念压缩输入代表,同时保留抽取的信息以适合框架内插。此外,我们既不需要预先训练的流动网络,也不需要合成网络,同时不需要额外减少可训练参数和所需记忆的数量。在对3个4K基准进行评估时,我们在不经过事先训练的流中,在拥有最低网络复杂性和总体记忆要求的情况下,在不经过训练的流中实现最先进的图像质量。