Re-Imagen: 检索增强的文本到图像生成器 (Re-Imagen: Retrieval-Augmented Text-to-Image Generator)

Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs, and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval condition to balance the text and retrieval alignment. Re-Imagen achieves new SoTA FID results on two image generation benchmarks, such as COCO (ie, FID = 5.25) and WikiImage (ie, FID = 5.82) without fine-tuning. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple visual domains. Human evaluation on EntityDrawBench shows that Re-Imagen performs on par with the best prior models in photo-realism, but with significantly better faithfulness, especially on less frequent entities.

翻译：文本到图像生成的研究在生成多样化和照片-现实化图像方面取得了显著进展,这是在大规模图像-文本数据方面受过推广和自动回归培训的模型的驱动下,在生成多样化和照片-现实化图像方面取得了显著进展。尽管最先进的模型能够生成高质量的普通实体图像,但它们往往难以生成“Chortai(狗)”或“Piccarones(食品)”等非正常实体的图像。为了解决这一问题,我们展示了Retredival-增强的文本到图像生成基准(Re-Imagen),这是一个利用检索的信息生成高度的正读和忠实图像的基因化模型,即使是对稀有或隐蔽的实体也是如此。如果使用文本,Re-I进入外部多模式获取相关(像、文本),则通过高层次的读取和低层次的图像生成(Remageni-real-remial-realislationalality),通过在图像采集实体的精确度上进行更新,在图像-remial-redial-real-Regial-Regial-real-deal-real-de-deal-de-de-deal-deal-de-de-de laut the laut the laut the lax laut laut laut laut laut the laut the laut the lax lax lax laut laut la der der la lax lax lax laut laut laut lax laut laut laut lader laut lader laut lader lader laut lax lax lax laders laut laut laut laut laut laut laut laut laut laut laut laut lader lauts lader lader lader lader lauts lader lader lader laut laut laut laut lader lader lader lader lader lader lader lader lader lader la la la la

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日