Generative image synthesis with diffusion models has recently achieved excellent visual quality in several tasks such as text-based or class-conditional image synthesis. Much of this success is due to a dramatic increase in the computational capacity invested in training these models. This work presents an alternative approach: inspired by its successful application in natural language processing, we propose to complement the diffusion model with a retrieval-based approach and to introduce an explicit memory in the form of an external database. During training, our diffusion model is trained with similar visual features retrieved via CLIP and from the neighborhood of each training instance. By leveraging CLIP's joint image-text embedding space, our model achieves highly competitive performance on tasks for which it has not been explicitly trained, such as class-conditional or text-image synthesis, and can be conditioned on both text and image embeddings. Moreover, we can apply our approach to unconditional generation, where it achieves state-of-the-art performance. Our approach incurs low computational and memory overheads and is easy to implement. We discuss its relationship to concurrent work and will publish code and pretrained models soon.
翻译:与传播模型的生成图像合成最近在若干任务(如基于文本的图像合成)中取得了极佳的视觉质量。这一成功在很大程度上是由于在培训这些模型时投入的计算能力急剧增加。这项工作提出了一种替代方法:由于在自然语言处理中成功应用了该模型,我们提议以基于检索的方法作为传播模型的补充,并以外部数据库的形式引入明确的记忆。在培训期间,我们的传播模型通过通过CLIP和每个培训实例的周边区域检索到类似的视觉特征进行了培训。通过利用CLIP的图像文本联合嵌入空间,我们的模型在未经明确培训的任务上取得了高度竞争性的业绩,例如课堂条件合成或文本图像合成,并且可以同时以文本和图像嵌入为条件。此外,我们可以应用我们的方法无条件生成,在其中取得最先进的业绩。我们的方法是低的计算和记忆间接成本,并且容易实施。我们讨论它与同时工作的关系,并将很快公布代码和预设模型。