Text-to-image generation models represent the next step of evolution in image synthesis, offering natural means of flexible yet fine-grained control over the result. One emerging area of research is the rapid adaptation of large text-to-image models to smaller datasets or new visual concepts. However, the most efficient method of adaptation, called textual inversion, has a known limitation of long training time, which both restricts practical applications and increases the experiment time for research. In this work, we study the training dynamics of textual inversion, aiming to speed it up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard model convergence metrics fail to indicate that. Instead, we propose a simple early stopping criterion that only requires computing the textual inversion loss on the same inputs for all training iterations. Our experiments on both Latent Diffusion and Stable Diffusion models for 93 concepts demonstrate the competitive performance of our method, speeding adaptation up to 15 times with no significant drops in quality.
翻译:文本到图像生成模型代表了图像合成的下一个进化步骤,提供了灵活但细微控制结果的自然手段。一个新兴的研究领域是将大型文本到图像模型迅速适应于较小的数据集或新的视觉概念。然而,最有效的适应方法,即称为文字翻版,对长期培训时间的已知限制是有限的,这既限制了实际应用,也增加了实验研究时间。在这项工作中,我们研究了文本转换的培训动态,目的是加速其速度。我们发现大多数概念都是在早期阶段学习的,在质量上没有在以后改进,但标准模型合并指标却没有表明这一点。相反,我们提出了一个简单的早期停止标准,要求所有培训迭代输入的相同投入只计算文字转换损失。我们关于93个概念的Lenttent Difupulation和Stabtable Difulution模型的实验显示了我们方法的竞争性表现,在质量没有显著下降的情况下加快了15次的适应速度。