We present a technique for zero-shot generation of a 3D model using only a target text prompt. Without any 3D supervision our method deforms the control shape of a limit subdivided surface along with its texture map and normal map to obtain a 3D asset that corresponds to the input text prompt and can be easily deployed into games or modeling applications. We rely only on a pre-trained CLIP model that compares the input text prompt with differentiably rendered images of our 3D model. While previous works have focused on stylization or required training of generative models we perform optimization on mesh parameters directly to generate shape, texture or both. To constrain the optimization to produce plausible meshes and textures we introduce a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding.
翻译:我们展示了一种技术,用于零光生成三维模型, 仅使用目标文本提示。 在没有任何三维监督的情况下, 我们的方法将限制子分割表面的控制形状及其纹理图和普通地图进行变形, 以获得与输入文本匹配的三维资产, 并且可以很容易地被应用到游戏或建模应用程序中。 我们只依靠一个经过预先训练的 CLIP 模型, 将输入文本与我们三维模型的不同成像进行比较。 虽然以前的工作侧重于立体化或需要对基因模型的培训, 我们直接对网状参数进行优化, 以生成形状、 纹理或两者。 为了限制优化以产生可信的模具和纹理, 我们引入了一些技术, 使用图像放大和在预先训练之前使用技术, 生成 CLIP 图像嵌入文本嵌入 。