Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.
翻译:大规模传播模型在文本到图像合成(T2I)任务方面取得了最先进的成果。我们注意到,尽管它们有能力生成高质量但具有创造性的图像,但归因约束性和构成能力仍被视为具有重大挑战性的问题,特别是在涉及多个对象的情况下。在这项工作中,我们提高了T2I模型的构成技能,特别是更准确的属性约束和更好的图像构成。为此,我们根据在基于传播的T2I模型中操纵跨关注层的可控制特性,将语言结构与传播指导进程结合起来。我们观察到,跨关注层的关键和价值具有与对象布局和内容相关的强烈的语义含义。因此,我们可以更好地维护生成图像中的构成语义,通过根据语言洞察力来操纵跨关注的表述。在SATA T2I模型中建立,我们结构化的跨关注设计是高效的,不需要额外的培训样本。我们在定性和定量模型中获得了更好的构成技能,导致在生成的用户结构的生成过程中拥有5-8%的优势,从而能够进行深度的图像生成分析。