We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses. In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model. The core idea of our approach is a tailored viewpoint selection such that the content of each image can be fused into a seamless, textured 3D mesh. More specifically, we propose a continuous alignment strategy that iteratively fuses scene frames with the existing geometry to create a seamless mesh. Unlike existing works that focus on generating single objects or zoom-out trajectories from text, our method generates complete 3D scenes with multiple objects and explicit 3D geometry. We evaluate our approach using qualitative and quantitative metrics, demonstrating it as the first method to generate room-scale 3D geometry with compelling textures from only text as input.
翻译:我们提出了Text2Room,一种从给定文本提示作为输入生成整个房间规模纹理3D网格的方法。为此,我们利用预训练的2D文本到图像模型来合成不同姿态的图像序列。为了将这些输出提升为一致的3D场景表示,我们将单目深度估计与文本条件修补模型相结合。我们的方法的核心思想是通过定制的视角选择将每个图像的内容融合到无缝的纹理3D网格中。更具体地说,我们提出了一种连续的对齐策略,通过将场景帧与现有几何图形迭代融合来创建无缝的网格。与现有的重点在生成单个对象或从文本生成缩放轨迹的作品不同,我们的方法生成具有多个对象和明确的3D几何的完整3D场景。我们使用定性和定量指标评估了我们的方法,证明它是第一种仅以文本作为输入生成具有引人注目纹理的房间规模3D几何的方法。