Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: https://chail.github.io/persistent-nature/.
翻译:尽管图像质量越来越逼真,近期的3D图像生成模型通常在有限的3D体积上进行操作,相机运动有限。我们研究了无条件合成无限自然场景的任务,使相机运动可以任意大,同时维持持久的3D世界模型。我们的场景表示包括一个可扩展的平面场景布局网格,可以通过3D编码器和音量渲染从任意相机姿势渲染,以及全景天空穹顶。基于该表示,我们仅从单一视角的互联网照片中学习一种生成世界模型。我们的方法使得能够模拟长途飞行穿过3D景观,同时保持全局场景的一致性--例如,返回起点可以得到相同视角的场景。我们的方法使得场景超越了当前3D生成模型的固定边界,同时支持一种持久的、摄像机无关的世界表示,这与自回归的3D预测模型形成对比。我们的项目页面:https://chail.github.io/persistent-nature/。