Recent advances in personalized image generation allow a pre-trained text-to-image model to learn a new concept from a set of images. However, existing personalization approaches usually require heavy test-time finetuning for each concept, which is time-consuming and difficult to scale. We propose InstantBooth, a novel approach built upon pre-trained text-to-image models that enables instant text-guided image personalization without any test-time finetuning. We achieve this with several major components. First, we learn the general concept of the input images by converting them to a textual token with a learnable image encoder. Second, to keep the fine details of the identity, we learn rich visual feature representation by introducing a few adapter layers to the pre-trained model. We train our components only on text-image pairs without using paired images of the same concept. Compared to test-time finetuning-based methods like DreamBooth and Textual-Inversion, our model can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation while being 100 times faster.
翻译:近期个性化图像生成领域的进展使预训练的文本到图像模型能够从一组图像中学习新的概念。然而,现有的个性化方法通常需要针对每个概念进行繁重的测试时间微调,这是耗时且难以扩展的。我们提出了InstantBooth,这是一种建立在预训练文本到图像模型之上的新方法,可以在不进行任何测试时间微调的情况下实现即时文本引导的个性化图像。我们通过几个重要组件实现这一目标。首先,我们通过使用可学习的图像编码器将输入图像转换为文本令牌以了解其概念。其次,为了保留身份的细节,我们通过向预训练模型引入一些适配器层来学习丰富的视觉特征表示。我们仅在文本-图像对上训练我们的组件,而不使用相同概念的成对图像。与DreamBooth和Textual-Inversion等测试时间微调方法相比,我们的模型可以在涉及语言-图像对齐、图像保真度和身份保存的未见概念上生成具有竞争性的结果,同时速度快100倍。