In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.
翻译:在专业视频合成工作流中,艺术家需要手动创建前景主体与背景图层之间的环境交互效果——例如阴影、反射、灰尘与飞溅。现有视频生成模型难以在添加此类效果的同时保持输入视频内容;当前视频修复方法则要么需要逐帧的昂贵掩码标注,要么会产生不合理的合成结果。我们提出增强合成这一新任务,其目标是在保持原始场景的前提下,根据文本提示与输入视频图层合成逼真的半透明环境效果。针对该任务,我们提出了Over++视频特效生成框架,该框架无需对相机位姿、场景静止性或深度监督做任何假设。我们为此任务构建了配对的特效数据集,并提出一种保持文本驱动编辑能力的非配对增强策略。我们的方法还支持可选的掩码控制与关键帧引导,且无需密集标注。尽管在有限数据上训练,Over++仍能生成多样且逼真的环境效果,并在特效生成与场景保持两方面均优于现有基线方法。