Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts. However, current personalization approaches struggle with lengthy training times, high storage requirements or loss of identity. To overcome these limitations, we propose an encoder-based domain-tuning approach. Our key insight is that by underfitting on a large set of concepts from a given domain, we can improve generalization and create a model that is more amenable to quickly adding novel concepts from the same domain. Specifically, we employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain, e.g. a specific face, and learns to map it into a word-embedding representing the concept. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts. Together, these components are used to guide the learning of unseen concepts, allowing us to personalize a model using only a single image and as few as 5 training steps - accelerating personalization from dozens of minutes to seconds, while preserving quality.
翻译:文字到图像个人化的目的是教一个经过预先训练的传播模型,以了解用户提供的新概念,将其嵌入由自然语言提示指导的新场景。 然而,当前个性化方法与冗长的培训时间、高存储要求或身份丢失相争。 为了克服这些限制,我们提议了一个基于编码器的域调法方法。 我们的关键见解是,通过从一个特定领域对一大批概念进行校准,我们可以改进一般化,创建一个更便于迅速从同一领域添加新概念的模式。 具体地说,我们使用两个组成部分: 首先,一个编码器,作为输入一个特定领域目标概念的单一图像,例如一个特定面孔,并学会将其绘制成一个代表这个概念的字组。 其次,一套固定化的文本到图像模型的权重,以学习如何有效地吸收更多概念。 这些组成部分一起用来指导对未知概念的学习,使我们能够将一个模型个人化为个人化,只使用一个图像,只有5个培训步骤—— 加速个人化,同时保存数十分钟的质量。</s>