Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks. Our code, data and new words will be available at: https://textual-inversion.github.io
翻译:文本到图像模型提供了前所未有的自由来引导自然语言的创造。 然而,还不清楚如何行使这种自由来生成特定独特概念的图像,改变其外观,或将其组成为新的角色和新场景。 换句话说, 我们问: 我们如何使用语言引导模型将猫变成画, 或者根据我们最喜欢的玩具来想象新产品? 我们在这里展示了一个允许这种创造性自由的简单方法。 我们只使用3-5个用户提供的概念图像,比如一个物体或风格,我们学会通过在冷冻文本到图像模型的嵌入空间中的新“ 字” 来代表它。 这些“ 字” 可以组成自然语言句子, 用直观的方式指导个性化的创作。 值得注意的是, 我们发现一个单词嵌入足以捕捉独特和不同的概念。 我们将我们的方法与广泛的基线进行比较, 并证明它能够更加忠实地描述各种应用和任务的概念。 我们的代码、 数据和新词可以在 https://textual-inversion.githubio 上找到。