We propose VINO, the first zero-shot, training-free video editing method conditioned on both image and text. Our approach introduces $ρ$-start sampling and dilated dual masking to construct structured noise maps that enable coherent and accurate edits. To further enhance visual fidelity, we present zero image guidance, a controllable negative prompt strategy. Extensive experiments demonstrate that VINO faithfully incorporates the reference image into video edits, achieving strong performance compared to state-of-the-art baselines, all without any test-time or instance-specific training.
翻译:我们提出了VINO,首个基于图像与文本双重条件的零样本、无训练视频编辑方法。该方法引入$ρ$-起始采样与扩张双重掩码技术,通过构建结构化噪声图实现连贯且精确的编辑。为提升视觉保真度,我们提出了零图像引导策略——一种可控的负向提示技术。大量实验表明,VINO能够将参考图像忠实融入视频编辑过程,在无需任何测试阶段或实例特定训练的情况下,相较现有先进基线方法展现出卓越性能。