Effectively extracting inter-frame motion and appearance information is important for video frame interpolation (VFI). Previous works either extract both types of information in a mixed way or elaborate separate modules for each type of information, which lead to representation ambiguity and low efficiency. In this paper, we propose a novel module to explicitly extract motion and appearance information via a unifying operation. Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction. Furthermore, for efficient VFI, our proposed module could be seamlessly integrated into a hybrid CNN and Transformer architecture. This hybrid pipeline can alleviate the computational complexity of inter-frame attention as well as preserve detailed low-level structure information. Experimental results demonstrate that, for both fixed- and arbitrary-timestep interpolation, our method achieves state-of-the-art performance on various datasets. Meanwhile, our approach enjoys a lighter computation overhead over models with close performance. The source code and models are available at https://github.com/MCG-NJU/EMA-VFI.
翻译:之前的工作要么以混合方式提取两种类型的信息,要么为每种类型的信息拟订单独的模块,从而导致代表性的模糊性和低效率。在本文件中,我们提出了一个通过统一作业明确提取运动和外观信息的新模块。具体地说,我们重新思考跨框架关注的信息过程,并重新使用其关注地图,以提高外观特征和提取运动信息。此外,为了高效的VFI,我们提议的模块可以无缝地融入一个混合型CNN和变异器结构。这种混合式管道可以减轻跨框架关注的计算复杂性,并保存详细的低层次结构信息。实验结果表明,对于固定和任意的跨时间的交叉作业,我们的方法在各种数据集上都达到最新业绩。与此同时,我们的方法在模型上使用较轻的计算,功能接近。源码和模型见https://github.com/MCG-NJU/EMA-VFI。</s>