Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Of particular note is the field of ``AI-Art'', which has seen unprecedented growth with the emergence of powerful multimodal models such as CLIP. By combining speech and image synthesis models, so-called ``prompt-engineering'' has become established, in which carefully selected and composed sentences are used to achieve a certain visual style in the synthesized image. In this note, we present an alternative approach based on retrieval-augmented diffusion models (RDMs). In RDMs, a set of nearest neighbors is retrieved from an external database during training for each training instance, and the diffusion model is conditioned on these informative samples. During inference (sampling), we replace the retrieval database with a more specialized database that contains, for example, only images of a particular visual style. This provides a novel way to prompt a general trained model after training and thereby specify a particular visual style. As shown by our experiments, this approach is superior to specifying the visual style within the text prompt. We open-source code and model weights at https://github.com/CompVis/latent-diffusion .
翻译:特别值得注意的是,在RDMs中,每例培训期间从外部数据库中检索到一组最近的邻居,扩散模式以这些信息样本为条件。在推断(抽样)期间,我们用一个更专门的数据库取代检索数据库,该数据库只包含特定视觉风格的图像。这为在培训后激发一个经过一般培训的模型,从而指定一个特定的视觉风格提供了新的方法。正如我们的实验所显示的那样,这一方法优于在文本提示中指定视觉风格。我们在 https://gifentlip-Compliflip/Comporation上我们打开源代码和模型重量。