Synthesizing 3D scenes from open-vocabulary text descriptions is a challenging, important, and recently-popular application. One of its critical subproblems is layout generation: given a set of objects, lay them out to produce a scene matching the input description. Nearly all recent work adopts a declarative paradigm for this problem: using an LLM to generate a specification of constraints between objects, then solving those constraints to produce the final layout. In contrast, we explore an alternative imperative paradigm, in which an LLM iteratively places objects, with each object's position and orientation computed as a function of previously-placed objects. The imperative approach allows for a simpler scene specification language while also handling a wider variety and larger complexity of scenes. We further improve the robustness of our imperative scheme by developing an error correction mechanism that iteratively improves the scene's validity while staying as close as possible to the original layout generated by the LLM. In forced-choice perceptual studies, participants preferred layouts generated by our imperative approach 82% and 94% of the time when compared against two declarative layout generation methods. We also present a simple, automated evaluation metric for 3D scene layout generation that aligns well with human preferences.
翻译:从开放词汇文本描述合成三维场景是一项具有挑战性、重要性且近期备受关注的应用。其关键子问题之一是布局生成:给定一组物体,将其布局以生成与输入描述匹配的场景。几乎所有近期研究都采用声明式范式处理此问题:使用LLM生成物体间约束关系的规范,随后求解这些约束以产生最终布局。相比之下,我们探索了一种替代的命令式范式,其中LLM迭代地放置物体,每个物体的位置和方向根据先前放置的物体计算得出。命令式方法允许使用更简单的场景规范语言,同时能处理更广泛类型和更高复杂度的场景。我们通过开发一种错误修正机制进一步提升了命令式方案的鲁棒性,该机制在尽可能保持LLM原始生成布局的同时,迭代地提升场景的有效性。在强制选择感知研究中,参与者分别以82%和94%的比例更倾向于选择我们的命令式方法生成的布局(相较于两种声明式布局生成方法)。我们还提出了一种简单且自动化的三维场景布局生成评估指标,该指标与人类偏好具有良好的一致性。