Existing works on video frame interpolation (VFI) mostly employ deep neural networks trained to minimize the L1 or L2 distance between their outputs and ground-truth frames. Despite recent advances, existing VFI methods tend to produce perceptually inferior results, particularly for challenging scenarios including large motions and dynamic textures. Towards developing perceptually-oriented VFI methods, we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method following the common evaluation protocol adopted in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with superior perceptual quality compared to the state of the art, even in the high-resolution regime. Our source code will be made available here.
翻译:视频框架内插(VFI)的现有工作大多采用受过训练的深层神经网络,以尽量减少其产出与地面实况框架之间的L1或L2距离。尽管最近有所进步,但现有的VFI方法往往在概念上产生低劣的结果,特别是对具有挑战性的情景,包括大动作和动态纹理而言。为了开发以概念为导向的VFI方法,我们建议从潜在扩散模型的角度将VFI问题发展为有条件的一代问题。作为利用潜在扩散模型解决VFI问题的首个努力,我们严格根据现有的VFI文献中采用的共同评价协议确定我们的方法。我们的定量实验和用户研究表明,即使在高分辨率系统中,LDMVIFI能够将高认知质量的视频内容与最新技术进行交叉。我们的源代码将在这里公布。</s>