Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.
翻译:现有的视频全遮罩方法通常依赖于缓慢、多阶段或推理时优化的流程,未能充分利用强大的生成先验,导致分解效果欠佳。我们的核心洞见是:如果一个视频修复模型能够通过微调来移除与前景相关的效果,那么它必然具备感知这些效果的内在能力,因此也可以被微调用于完成互补任务——即分解前景层及其关联效果。然而,尽管简单地对修复模型应用LoRA并微调所有模块可以生成高质量的前景透明度遮罩,却无法有效捕捉关联效果。我们的系统分析表明,这是因为效果相关的线索主要编码在特定的DiT模块中,当LoRA应用于所有模块时,这些线索会受到抑制。为解决这一问题,我们提出了EasyOmnimatte,首个统一、端到端的视频全遮罩方法。具体而言,我们微调一个预训练的视频修复扩散模型,使其学习两个互补的专家模型,同时保持其原始权重不变:一个“效果专家”,仅对效果敏感的DiT模块应用LoRA,以捕捉前景及其关联效果的粗粒度结构;另一个“质量专家”则通过全模块LoRA微调来学习优化前景透明度遮罩。在采样过程中,早期高噪声阶段使用效果专家进行去噪,而后期低噪声阶段则由质量专家接管。这一设计避免了两次完整的扩散过程,在保证输出质量的同时显著降低了计算成本。消融实验验证了这种双专家策略的有效性。实验表明,EasyOmnimatte在视频全遮罩任务上取得了新的最优性能,并支持多种下游任务,在质量和效率上均显著超越基线方法。