We present DiffIR2VR-Zero, a zero-shot framework that enables any pre-trained image restoration diffusion model to perform high-quality video restoration without additional training. While image diffusion models have shown remarkable restoration capabilities, their direct application to video leads to temporal inconsistencies, and existing video restoration methods require extensive retraining for different degradation types. Our approach addresses these challenges through two key innovations: a hierarchical latent warping strategy that maintains consistency across both keyframes and local frames, and a hybrid token merging mechanism that adaptively combines optical flow and feature matching. Through extensive experiments, we demonstrate that our method not only maintains the high-quality restoration of base diffusion models but also achieves superior temporal consistency across diverse datasets and degradation conditions, including challenging scenarios like 8$\times$ super-resolution and severe noise. Importantly, our framework works with any image restoration diffusion model, providing a versatile solution for video enhancement without task-specific training or modifications. Project page: https://jimmycv07.github.io/DiffIR2VR_web/
翻译:我们提出了DiffIR2VR-Zero,一个零样本框架,它使得任何预训练的扩散图像修复模型无需额外训练即可执行高质量的视频修复。尽管扩散图像模型已展现出卓越的修复能力,但将其直接应用于视频会导致时间不一致性,而现有的视频修复方法需要针对不同的退化类型进行大量的重新训练。我们的方法通过两项关键创新应对这些挑战:一种在关键帧与局部帧之间保持一致性的分层潜在扭曲策略,以及一种自适应融合光流与特征匹配的混合令牌合并机制。通过大量实验,我们证明我们的方法不仅保持了基础扩散模型的高质量修复能力,还在多样化的数据集和退化条件下(包括诸如8$\times$超分辨率和严重噪声等挑战性场景)实现了卓越的时间一致性。重要的是,我们的框架可与任何图像修复扩散模型协同工作,为视频增强提供了一个无需任务特定训练或修改的通用解决方案。项目页面:https://jimmycv07.github.io/DiffIR2VR_web/