Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at https://github.com/davidhalladay/Frido.
翻译:集成模型(DMs)显示了高质量图像合成的巨大潜力。然而,在以复杂场景制作图像时,如何正确描述图像全球结构和目标细节仍然是一项艰巨的任务。在本文件中,我们介绍了Frido,一个具有多尺度的全成至全成分解过程的Frido,这个Frido是用于图像合成的功能性Pyramid集成模型。我们的模型分解了一种输入图像,将其转化为基于比例的矢量量化功能,随后是用于制作图像产出的粗微图解。在以上多尺度的演示学习阶段,可以进一步利用文本、图像图表或图像布局、图像布局、图像布局等附加投入条件。因此,Frido也可以用于有条件或跨模版图像合成。我们对各种不附带条件和有条件的图像生成任务进行了广泛的实验,从文本合成到图像合成、布局到图像图像、景色图到图像成像,再贴标签到图像成像。更具体地说,我们在五个基准基准(即布局到图像、GO-IM-G-G-GIS-O)和O-O-I-GI-GIS-GIS-GIS-GIS-S-S-SO-GIS-S-S-SO-GIS-GIS-GIS-SO-G-S-S-G-S-G-S-S-S-I-S-S-S-S-SO-S-S-S-S-I-I-GIS-GIS-SO-G-G-G-G-GIS-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-GGGGG-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G