We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.
翻译:本文提出任意时空视频补全任务,该任务允许用户将任意指定的图像补丁放置于任意空间位置与时间戳来生成视频,类似于在视频画布上进行绘制。这种灵活的框架自然地将许多现有可控视频生成任务——包括首帧图像到视频生成、修复、延展与插值——统一在单一连贯的范式之下。然而,在现代潜在视频扩散模型中实现这一愿景面临一个根本性障碍:因果变分自编码器引入的时间模糊性,即多个像素帧被压缩为单一潜在表示,使得在结构上难以实现精确的帧级条件控制。为应对这一挑战,我们提出VideoCanvas——一种基于上下文条件化范式的新型框架,该框架在零新增参数的条件下适配这种细粒度控制任务。我们提出一种解耦空间与时间控制的混合条件化策略:空间布局通过零填充实现,而时间对齐则通过时序旋转位置编码插值完成,该方法为每个条件在潜在序列中分配连续分数位置。这一方案解决了变分自编码器的时间模糊性问题,并在冻结主干网络上实现了像素帧感知控制。为评估这一新能力,我们构建了VideoCanvasBench——首个面向任意时空视频补全的基准测试,涵盖场景内保真度与场景间创造性两个维度。实验表明,VideoCanvas显著优于现有条件化范式,在灵活统一的视频生成领域确立了新的技术标杆。