We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images. On the Matterport3D and RealEstate10K datasets, our approach significantly outperforms prior work when evaluated by humans, as well as on FID scores. Further, we show that our model is useful for generative data augmentation. A vision-and-language navigation (VLN) agent trained with trajectories spatially-perturbed by our model improves success rate by up to 1.5% over a state of the art baseline on the R2R benchmark. Our code will be made available to facilitate generative data augmentation and applications to downstream robotics and embodied AI tasks.
翻译:我们从一种或多种图像中研究将隐蔽的 3D 室内图像合成的问题。 我们的目标是从新观点中生成高分辨率图像和视频,包括远远超出输入图像的外推观点,同时保持 3D 一致性。 现有方法非常复杂,有许多单独的阶段和组件。 我们建议一个简单的替代方案: 图像到图像GAN, 该图像从不完全点云的重新投影到完整的高分辨率 RGB-D 图像。 在物质3D 和 RealEstate10K 数据集上, 我们的方法在由人类进行评估时大大优于先前的工作, 以及FID 得分。 此外, 我们展示了我们的模型对于增强基因化数据的作用。 视觉和语言导航(VLN) 代理受到我们模型的空间环绕的训练, 其成功率比 R2R 基准的艺术基线水平提高到1.5%。 我们的代码将用来促进基因化数据增强和应用下游机器人并体现AI 任务。