In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.
翻译:在本文中,我们研究了通过给定文本描述生成可控三维纹理形状的开放式研究任务。以往的工作要求具有地面实况字幕标注或广泛的优化时间。为解决这些问题,我们提出了一种新颖的框架,TAPS3D,以伪标注训练文本引导的三维形状生成器。具体来说,我们基于渲染的二维图像,从CLIP词汇中检索相关词语,并使用模板构建伪字幕。我们构建的字幕为生成的三维形状提供了高级语义监督。此外,为了产生细粒度的纹理并增加几何多样性,我们建议采用低级别的图像正则化方法,以使假render图像能够与真实图像对齐。在推理阶段,我们的模型可以从给定的文本生成3D纹理形状,而无需任何额外的优化。我们进行了广泛的实验,以分析我们提出的每个组件,并展示了我们的框架在生成高保真度的3D纹理形状和与文本相关的形状方面的实效性。