We introduce X&Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X&Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve&Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art FID score of 6.65 in zero-shot settings. (ii) When cropped-object images are at hand, we utilize them and perform subject-driven generation (Crop&Fuse), outperforming the textual inversion method while being more than x100 faster. (iii) Having oracle access to the image scene (Scene&Fuse), allows us to achieve an FID score of 5.03 on MS-COCO in zero-shot settings. Our experiments indicate that X&Fuse is an effective, easy-to-adapt, simple, and general approach for scenarios in which the model may benefit from additional visual information.
翻译:我们引入了X & Fuse, 这是在生成文本图像时对视觉信息进行调节的一般方法。 我们展示了 X & Fuse 在三种不同的文本到图像生成情景中的潜力。 (一) 当有图像库时,我们检索并附加相关图像(Retreve & Fuse),从而大大改进了MS-COCO基准,在零发环境中获得了6.65个最先进的FID分数。 (二) 当所裁对象图像在手的时候,我们利用这些图像并进行主题驱动的生成(Crop & Fuse),在速度超过x100以上的情况下,优于文本转换方法。 (三) 获得图像场(Sceen & Fuse),使我们能够在零发环境中的MS-CO上达到5.03个FID分。我们的实验表明, X&Fuse是一种有效、容易适应、简单和一般的方法,用于模型可能受益于更多视觉信息的情景。</s>