Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{https://hyokong.github.io/worldwarp-page/}{https://hyokong.github.io/worldwarp-page/}.
翻译:生成长范围、几何一致的视频面临一个根本性困境:一致性要求在像素空间中严格遵循三维几何,而最先进的生成模型在相机条件化的潜空间中运行最为有效。这种脱节导致现有方法在处理遮挡区域和复杂相机轨迹时存在困难。为弥合这一差距,我们提出了WorldWarp框架,该框架将三维结构锚点与二维生成细化器相耦合。为实现几何基础,WorldWarp通过高斯泼溅(3DGS)维护一个在线的三维几何缓存。通过将历史内容显式变形到新视角,该缓存充当结构支架,确保每个新帧都遵循先前的几何结构。然而,静态变形不可避免地会因遮挡留下孔洞和伪影。我们采用专为“填充-修正”目标设计的时空扩散(ST-Diff)模型来解决此问题。我们的核心创新是一种时空变化的噪声调度方案:空白区域接受完整噪声以触发生成,而变形区域接受部分噪声以实现细化。通过在每一步动态更新三维缓存,WorldWarp保持了视频片段间的一致性。因此,该方法通过确保三维逻辑指导结构、扩散逻辑完善纹理,实现了最先进的生成保真度。项目页面:\href{https://hyokong.github.io/worldwarp-page/}{https://hyokong.github.io/worldwarp-page/}。