Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.
翻译:文本驱动的三维场景生成在虚拟原型设计、增强/虚拟现实以及仿真等领域具有广泛的应用前景。然而,现有方法通常局限于单对象生成,需要特定领域的训练数据,或缺乏完整的360度视角支持。本研究提出了一种免训练的三维场景合成方法,通过将通用文本到三维对象扩散模型重新用作模块化的图块生成器。我们将场景生成重新定义为多图块去噪问题,其中重叠的三维区域被独立生成,并通过加权平均实现无缝融合。该方法能够实现大规模、连贯场景的可扩展合成,同时保持局部语义控制。我们的方法无需场景级数据集或重新训练,仅依赖最小化的启发式规则,并继承了对象级先验的泛化能力。实验表明,该方法支持多样化的场景布局、高效生成与灵活编辑,为通用语言驱动的三维场景构建奠定了简洁而强大的基础。