Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval conditions to balance the text and retrieval alignment. Re-Imagen achieves significant gain on FID score over COCO and WikiImage. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.
翻译:对文字到图像生成的研究在生成多样化和照片真实化图像方面取得了显著进展,其驱动力是推广和自动递减模型,在大规模图像-文本数据方面经过了培训。尽管最先进的模型能够生成高质量的普通实体图像,但它们往往难以生成“Chortai(狗)”或“Picarones(食品)”等非常规实体的图像。为了解决这一问题,我们介绍了Retreferal-Agrication-Sext-图像生成器(Re-Imagen),这是一个基因化模型,利用已检索的信息生成高忠实度和忠实图像,即使是对稀有或未见的实体也是如此。鉴于文本提示提示提示,Re-Imagen访问一个外部多模式知识库,检索相关(图像、文本)配对面的配对,利用它们作为生成图像的参考。由于这一检索步骤,Re-Imagen可以随着对高层次的语系和低层次的直观数据进行更深入的检索,我们提到的实体在生成更精确的直观的图像上进行了评估,在直观的图像上进行更精确化的判读取。我们用了直观的文本到直观的文本,我们正在对正读的图像生成的文本进行着的图像生成,我们用来进行了文字生成的文本到快速化,我们用来进行了文字到读取。