Generating photorealistic images with controllable camera pose and scene contents is essential for many applications including AR/VR and simulation. Despite the fact that rapid progress has been made in 3D-aware generative models, most existing methods focus on object-centric images and are not applicable to generating urban scenes for free camera viewpoint control and scene editing. To address this challenging task, we propose UrbanGIRAFFE, which uses a coarse 3D panoptic prior, including the layout distribution of uncountable stuff and countable objects, to guide a 3D-aware generative model. Our model is compositional and controllable as it breaks down the scene into stuff, objects, and sky. Using stuff prior in the form of semantic voxel grids, we build a conditioned stuff generator that effectively incorporates the coarse semantic and geometry information. The object layout prior further allows us to learn an object generator from cluttered scenes. With proper loss functions, our approach facilitates photorealistic 3D-aware image synthesis with diverse controllability, including large camera movement, stuff editing, and object manipulation. We validate the effectiveness of our model on both synthetic and real-world datasets, including the challenging KITTI-360 dataset.
翻译:生成具有可控制的摄影机姿势和场景内容的逼真图像对于许多应用程序包括 AR / VR 和模拟至关重要。尽管在 3D 感知生成模型方面已经取得了快速进展,但大多数现有方法关注于以物体为中心的图像,不适用于自由摄像机视点控制和场景编辑生成城市场景的生成。为解决这个具有挑战性的任务,我们提出 UrbanGIRAFFE,其使用粗略的 3D 广视域先验,包括不可计数物品和可计数对象的布局分布,来引导 3D 感知生成模型。我们的模型是组成和可控的,因为它将场景分解为素材、对象和天空。使用像素网格的形式的素材前景,我们构建了一个条件前景生成器,该生成器有效地融合了粗糙的语义和几何信息。对象布局先验进一步使我们能够从杂乱的场景中学习对象生成器。通过适当的损失函数,我们的方法促进了具有多种可控性的逼真的 3D 感知图像合成,包括大型相机运动、素材编辑和对象操作。我们在合成和现实世界数据集上验证了我们的模型的有效性,包括具有挑战性的 KITTI-360 数据集。