Recent works demonstrate a remarkable ability to customize text-to-image diffusion models while only providing a few example images. What happens if you try to customize such models using multiple, fine-grained concepts in a sequential (i.e., continual) manner? In our work, we show that recent state-of-the-art customization of text-to-image models suffer from catastrophic forgetting when new concepts arrive sequentially. Specifically, when adding a new concept, the ability to generate high quality images of past, similar concepts degrade. To circumvent this forgetting, we propose a new method, C-LoRA, composed of a continually self-regularized low-rank adaptation in cross attention layers of the popular Stable Diffusion model. Furthermore, we use customization prompts which do not include the word of the customized object (i.e., "person" for a human face dataset) and are initialized as completely random embeddings. Importantly, our method induces only marginal additional parameter costs and requires no storage of user data for replay. We show that C-LoRA not only outperforms several baselines for our proposed setting of text-to-image continual customization, which we refer to as Continual Diffusion, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification. The high achieving performance of C-LoRA in two separate domains positions it as a compelling solution for a wide range of applications, and we believe it has significant potential for practical impact.
翻译:最近的工作展示了在仅提供少量示例图像的情况下,定制文本到图像扩散模型的卓越能力。如果您尝试以顺序(即连续)方式使用多个细粒度概念来定制这些模型会发生什么情况?在我们的工作中,我们展示了最近状态-of-the-art的定制文本到图像模型在新概念顺序到达时遭受灾难性遗忘。具体而言,当添加新概念时,生成旧的、类似概念的高质量图像的能力会降低。为了避免这种遗忘,我们提出了一种新的方法:C-LoRA,它由跨通道注意力层中的连续自正则化低秩适应组成的流行的稳定扩散模型。此外,我们使用不包括定制对象的单词(即对于人脸数据集,“人”)的定制提示,并将其初始化为完全随机的嵌入。重要的是,我们的方法只引入了微小的额外参数成本,并且不需要存储用户数据以进行重现。我们展示了C-LoRA不仅优于我们提出的文本到图像连续定制(我们称之为连续扩散)的多个基线,而且在图像分类的免除排练的连续学习设定中达到了新的状态-of-the-art。 C-LoRA在两个独立的领域中实现了高成就表现,使其成为广泛应用的解决方案,并且我们认为它具有重大的实践影响潜力。