The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.
翻译:大规模文本到图像(T2I)模型的惊人基因化能力已经证明学习复杂结构和有意义的语义学的强大力量。然而,仅仅依靠文本提示并不能充分利用模型所学的知识,特别是在需要灵活和准确的结构控制的情况下。在本文中,我们的目标是“挖掘”T2I模型隐含地学到的能力,然后明确使用这些能力来控制更颗粒的生成。具体地说,我们提议学习简单和小型的T2I-设计师,使T2I模型的内部知识与外部控制信号相匹配,同时冻结原有的大型T2I模型。这样,我们可以根据不同条件培训各种适应者,实现丰富的控制和编辑效果。此外,拟议的T2I-设计师具有具有实际价值的吸引力,如可兼容性和普及能力。广泛的实验表明,我们的T2I-Adapter具有很有希望的生成质量和广泛的应用范围。