Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
翻译:文本到三维生成技术发展迅速,但当前最先进的模型——包括基于优化的架构和前馈架构——仍面临两个基本限制。首先,它们在粗粒度语义对齐上存在困难,往往无法捕捉提示中的细粒度细节。其次,它们缺乏稳健的三维空间理解,导致几何不一致性以及在部件组装和空间关系上的灾难性失败。为解决这些挑战,我们提出了VLM3D,一个通用框架,将大型视觉语言模型重新用作强大的、可微分的语义与空间评判器。我们的核心贡献是基于VLM的“是或否”对数几率推导出的双查询评判信号,该信号同时评估语义保真度与几何一致性。我们展示了这一引导信号在两种不同范式中的普适性:(1) 作为基于优化流程的奖励目标,VLM3D在标准基准测试上显著优于现有方法。(2) 作为前馈流程的测试时引导模块,它主动引导最先进原生三维模型的迭代采样过程,以纠正严重的空间错误。VLM3D为将VLM丰富的、基于语言理解的语义与空间知识注入多样化的三维生成流程,提供了一条原则性且可泛化的路径。