Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs, and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval condition to balance the text and retrieval alignment. Re-Imagen achieves new SoTA FID results on two image generation benchmarks, such as COCO (ie, FID = 5.25) and WikiImage (ie, FID = 5.82) without fine-tuning. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple visual domains. Human evaluation on EntityDrawBench shows that Re-Imagen performs on par with the best prior models in photo-realism, but with significantly better faithfulness, especially on less frequent entities.
翻译:文本到图像生成的研究在生成多样化和照片-现实化图像方面取得了显著进展,这是在大规模图像-文本数据方面受过推广和自动回归培训的模型的驱动下,在生成多样化和照片-现实化图像方面取得了显著进展。尽管最先进的模型能够生成高质量的普通实体图像,但它们往往难以生成“Chortai(狗)”或“Piccarones(食品)”等非正常实体的图像。为了解决这一问题,我们展示了Retredival-增强的文本到图像生成基准(Re-Imagen),这是一个利用检索的信息生成高度的正读和忠实图像的基因化模型,即使是对稀有或隐蔽的实体也是如此。如果使用文本,Re-I进入外部多模式获取相关(像、文本),则通过高层次的读取和低层次的图像生成(Remageni-real-remial-realislationalality),通过在图像采集实体的精确度上进行更新,在图像-remial-redial-real-Regial-Regial-real-deal-real-de-deal-de-de-deal-deal-de-de-de laut the laut the laut the lax laut laut laut laut laut the laut the laut the lax lax lax laut laut la der der la lax lax lax laut laut laut lax laut laut laut lader laut lader laut lader lader laut lax lax lax laders laut laut laut laut laut laut laut laut laut laut laut laut lader lauts lader lader lader lader lauts lader lader lader laut laut laut laut lader lader lader lader lader lader lader lader lader lader la la la la