Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Much of this success is due to the scalability of these architectures and hence caused by a dramatic increase in model complexity and in the computational resources invested in training these models. Our work questions the underlying paradigm of compressing large training data into ever growing parametric representations. We rather present an orthogonal, semi-parametric approach. We complement comparably small diffusion or autoregressive models with a separate image database and a retrieval strategy. During training we retrieve a set of nearest neighbors from this external database for each training instance and condition the generative model on these informative samples. While the retrieval approach is providing the (local) content, the model is focusing on learning the composition of scenes based on this content. As demonstrated by our experiments, simply swapping the database for one with different contents transfers a trained model post-hoc to a novel domain. The evaluation shows competitive performance on tasks which the generative model has not been trained on, such as class-conditional synthesis, zero-shot stylization or text-to-image synthesis without requiring paired text-image data. With negligible memory and computational overhead for the external database and retrieval we can significantly reduce the parameter count of the generative model and still outperform the state-of-the-art.
翻译:最近,新手结构改进了基因化图像合成,使各种任务具有出色的视觉质量。这些成功在很大程度上是由于这些结构的可扩展性,因此是模型复杂性和用于培训这些模型的计算资源的急剧增加造成的。我们的工作质疑将大型培训数据压缩成不断增加的参数表示的基本模式。我们更倾向于提出一个正统、半参数性的方法。我们用一个单独的图像数据库和检索战略来补充一个相近的小型扩散或自动递减模型。在培训期间,我们从每个培训实例的外部数据库中检索到一组最近的邻居,并对这些信息样本进行基因化模型的设置。虽然检索方法正在提供(本地)内容,但该模型的重点是学习基于该内容的场景的构成。正如我们实验所证明的那样,简单地将数据库转换成一个不同的内容,将一个经过训练的模型后热量转换到一个新领域。评价显示,在模型上具有竞争性的表现,而该模型没有经过培训,例如,为每个培训实例的合成、零光线或文本到图像模型模型的模型模型,为这些信息样本的样本提供了条件。虽然正在提供(本地)内容,但需要大量进行IM模版的模拟的模拟的模拟的模拟的回转校验,我们可以大量的IM压。