Recent text-to-image models have achieved impressive results. However, since they require large-scale datasets of text-image pairs, it is impractical to train them on new domains where data is scarce or not labeled. In this work, we propose using large-scale retrieval methods, in particular, efficient k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a substantially small and efficient text-to-image diffusion model without any text, (2) generating out-of-distribution images by simply swapping the retrieval database at inference time, and (3) performing text-driven local semantic manipulations while preserving object identity. To demonstrate the robustness of our method, we apply our kNN approach on two state-of-the-art diffusion backbones, and show results on several different datasets. As evaluated by human studies and automatic metrics, our method achieves state-of-the-art results compared to existing approaches that train text-to-image generation models using images only (without paired text data)
翻译:近期的文本到图像模型取得了令人印象深刻的成果。然而,由于它们需要大量文本图像配对的数据集,因此在数据稀缺或没有标签的新领域对它们进行培训是不切实际的。在这项工作中,我们提议使用大规模检索方法,特别是高效的 k-Nearest-Neighbors (kNN),这提供了新的能力:(1) 培训一个没有任何文本的微小而高效的文本到图像传播模型,(2) 简单地在推论时间交换检索数据库,生成发送图像,(3) 进行文本驱动本地语义操作,同时保存对象特性。为了显示我们的方法的稳健性,我们在两个最先进的传播主干线上应用了我们的 kNN 方法,并在几个不同的数据集上展示了结果。根据人类研究和自动指标的评估,我们的方法与仅使用图像(没有配对文本数据)来培训文本到图像生成模型的现有方法相比,取得了最新的结果。