Language is one of the primary means by which we describe the 3D world around us. While rapid progress has been made in text-to-2D-image synthesis, similar progress in text-to-3D-shape synthesis has been hindered by the lack of paired (text, shape) data. Moreover, extant methods for text-to-shape generation have limited shape diversity and fidelity. We introduce TextCraft, a method to address these limitations by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs for training. TextCraft achieves this by using CLIP and using a multi-resolution approach by first generating in a low-dimensional latent space and then upscaling to a higher resolution, improving the fidelity of the generated shape. To improve shape diversity, we use a discrete latent space which is modelled using a bidirectional transformer conditioned on the interchangeable image-text embedding space induced by CLIP. Moreover, we present a novel variant of classifier-free guidance, which further improves the accuracy-diversity trade-off. Finally, we perform extensive experiments that demonstrate that TextCraft outperforms state-of-the-art baselines.
翻译:语言是我们描述周围3D世界的主要手段之一。 虽然在文本到-2D图像合成方面已经取得了快速的进展,但文本到-3D形状合成方面的进展却因缺少配对(文本、形状)数据而受阻。此外,文本到形状生成的剩余方法的形状多样性和忠诚度有限。我们引入了TextCraft, 这是一种通过产生高纤维性和不同的3D形状来克服这些局限性的方法,而无需(文本、形状)对培训的配对。 TextCraft 实现了这一点,它使用CLIP 和多分辨率方法,首先在低维潜层空间生成,然后提升到更高的分辨率,提高生成形状的正性。为了改进多样性,我们使用一个离散的潜伏空间,该空间以可互换的图像嵌入空间(CLIP)为模型。此外,我们提出了一个新的非分类指导变式,它进一步改进了精准度贸易基础。最后,我们进行了广泛的实验,展示了精确度-变换的文本。