Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.
翻译:大规模的文本到图像扩散模型取得了惊人的进展。然而,现状是仅使用文本输入,这可能会影响可控性。在这项工作中,我们提出了GLIGEN,基于语言和 grounding 输入的图像生成的新方法。该方法依托和扩展了现有的预训练文本到图像扩散模型的功能,使其能够同时受到 grounding 输入的限制。为了保留预训练模型的广泛概念知识,我们冻结了其所有权重,并通过门控机制将 grounding 信息注入到新的可训练层中。我们的模型实现了基于标题和边界框条件输入的开放式 grounded 文本到图像生成,并且 grounding 能力在新的空间配置和概念上具有良好的泛化性。GLIGEN 在 COCO 和 LVIS 上的零样本性能超过了现有的基于监督的布局到图像基线。