Recent diffusion-based generators can produce high-quality images based only on textual prompts. However, they do not correctly interpret instructions that specify the spatial layout of the composition. We propose a simple approach that can achieve robust layout control without requiring training or fine-tuning the image generator. Our technique, which we call layout guidance, manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the reconstruction in the desired direction given, e.g., a user-specified layout. In order to determine how to best guide attention, we study the role of different attention maps when generating images and experiment with two alternative strategies, forward and backward guidance. We evaluate our method quantitatively and qualitatively with several experiments, validating its effectiveness. We further demonstrate its versatility by extending layout guidance to the task of editing the layout and context of a given real image.
翻译:最近基于扩散的生成器可以仅基于文本提示生成高质量的图像。然而,它们不能正确解释指定组成的空间布局的指令。我们提出了一种简单的方法,可以在不需要对图像生成器进行训练或微调的情况下实现稳健的布局控制。我们的技术称为布局引导,它操纵交叉注意力层,该层用于接口文本和视觉信息,并根据用户指定的布局在所需方向上引导重建。为了确定如何最有效地引导注意力,我们研究生成图像时不同注意力图的作用,并尝试两种替代策略,前向和后向引导。我们通过多个实验定量和定性地评估了我们的方法,验证了其有效性。我们通过将布局引导扩展到给定的实际图像的布局和上下文编辑任务来进一步展示其多功能性。