混合文本指导的可编辑三维场景布局多物体组合 NeRF（CompoNeRF） (CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout)

Recent research endeavors have shown that combining neural radiance fields (NeRFs) with pre-trained diffusion models holds great potential for text-to-3D generation.However, a hurdle is that they often encounter guidance collapse when rendering complex scenes from multi-object texts. Because the text-to-image diffusion models are inherently unconstrained, making them less competent to accurately associate object semantics with specific 3D structures. To address this issue, we propose a novel framework, dubbed CompoNeRF, that explicitly incorporates an editable 3D scene layout to provide effective guidance at the single object (i.e., local) and whole scene (i.e., global) levels. Firstly, we interpret the multi-object text as an editable 3D scene layout containing multiple local NeRFs associated with the object-specific 3D box coordinates and text prompt, which can be easily collected from users. Then, we introduce a global MLP to calibrate the compositional latent features from local NeRFs, which surprisingly improves the view consistency across different local NeRFs. Lastly, we apply the text guidance on global and local levels through their corresponding views to avoid guidance ambiguity. This way, our CompoNeRF allows for flexible scene editing and re-composition of trained local NeRFs into a new scene by manipulating the 3D layout or text prompt. Leveraging the open-source Stable Diffusion model, our CompoNeRF can generate faithful and editable text-to-3D results while opening a potential direction for text-guided multi-object composition via the editable 3D scene layout.

翻译：最近的研究工作表明，将神经辐射场（NeRF）与预先训练的扩散模型相结合，在文本生成三维的方面具有巨大的潜力。然而，当从多物体文本中渲染复杂场景时，它们通常会遇到指导崩溃问题。因为文本到图像扩散模型本质上是无约束的，使它们不太能准确地将物体语义与具体的3D结构相联系。为了解决这个问题，我们提出了一种新的框架，称为 CompoNeRF，它明确地整合了一个可编辑的三维场景布局，提供了有效的单个物体（即本地）和整个场景（即全局）级别的指导。首先，我们将多物体文本解释为一个可编辑的3D场景布局，其中包含与物体特定的3D盒坐标和文本提示相关联的多个局部NeRF，这些信息可以从用户中轻松收集。然后，我们引入一个全局MLP来校准来自局部NeRF的组合潜在特征，这令人惊讶地提高了不同局部NeRF之间的视角一致性。最后，我们通过它们的相应视图在全局和局部级别上应用文本指导，以避免指导模糊性。这样，我们的CompoNeRF允许通过操作3D布局或文本提示来灵活地编辑场景，并将训练好的局部NeRF重新组合成新的场景。利用开源的稳态扩散模型，我们的CompoNeRF可以生成忠实和可编辑的文本到3D结果，同时为通过可编辑的三维场景布局进行文本指导的多物体组合开启了潜在的方向。