Re-Imagen: 检索增强的文本到图像生成器 (Re-Imagen: Retrieval-Augmented Text-to-Image Generator)

Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval conditions to balance the text and retrieval alignment. Re-Imagen achieves significant gain on FID score over COCO and WikiImage. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.

翻译：对文字到图像生成的研究在生成多样化和照片真实化图像方面取得了显著进展,其驱动力是推广和自动递减模型,在大规模图像-文本数据方面经过了培训。尽管最先进的模型能够生成高质量的普通实体图像,但它们往往难以生成“Chortai(狗)”或“Picarones(食品)”等非常规实体的图像。为了解决这一问题,我们介绍了Retreferal-Agrication-Sext-图像生成器(Re-Imagen),这是一个基因化模型,利用已检索的信息生成高忠实度和忠实图像,即使是对稀有或未见的实体也是如此。鉴于文本提示提示提示,Re-Imagen访问一个外部多模式知识库,检索相关(图像、文本)配对面的配对,利用它们作为生成图像的参考。由于这一检索步骤,Re-Imagen可以随着对高层次的语系和低层次的直观数据进行更深入的检索,我们提到的实体在生成更精确的直观的图像上进行了评估,在直观的图像上进行更精确化的判读取。我们用了直观的文本到直观的文本,我们正在对正读的图像生成的文本进行着的图像生成,我们用来进行了文字生成的文本到快速化,我们用来进行了文字到读取。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/