Generating shapes using natural language can enable new ways of imagining and creating the things around us. While significant recent progress has been made in text-to-image generation, text-to-shape generation remains a challenging problem due to the unavailability of paired text and shape data at a large scale. We present a simple yet effective method for zero-shot text-to-shape generation that circumvents such data scarcity. Our proposed method, named CLIP-Forge, is based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP. Our method has the benefits of avoiding expensive inference time optimization, as well as the ability to generate multiple shapes for a given text. We not only demonstrate promising zero-shot generalization of the CLIP-Forge model qualitatively and quantitatively, but also provide extensive comparative evaluations to better understand its behavior.
翻译:使用自然语言生成形状可以使我们以新的方式想象和创造周围的东西。 虽然最近在文本到图像生成方面取得了显著进展, 文本到形状生成仍是一个具有挑战性的问题, 原因是没有大规模配对文本和形状数据。 我们为零发文本到形状生成提出了一个简单而有效的方法, 绕过这种数据稀缺。 我们提议的名为 CLIP- Forge 的方法是基于一个两阶段的培训过程, 它只依赖于一个未标记的形状数据集和一个事先训练过的图像文本网络, 如 CLIP 。 我们的方法的好处是避免昂贵的推论时间优化, 以及能够生成给定文本的多重形状。 我们不仅展示了CLIP- Forge 模型在质量和数量上很有希望的零发概括化, 我们还提供了广泛的比较评估, 以便更好地了解它的行为。