Text-conditional diffusion models generate high-quality, diverse images. However, text is often an ambiguous specification for a desired target image, creating the need for additional user-friendly controls for diffusion-based image generation. We focus on having precise control over image output for scenes with several objects. Users control image generation by defining a collage: a text prompt paired with an ordered sequence of layers, where each layer is an RGBA image and a corresponding text prompt. We introduce Collage Diffusion, a collage-conditional diffusion algorithm that allows users to control both the spatial arrangement and visual attributes of objects in the scene, and also enables users to edit individual components of generated images. To ensure that different parts of the input text correspond to the various locations specified in the input collage layers, Collage Diffusion modifies text-image cross-attention with the layers' alpha masks. To maintain characteristics of individual collage layers that are not specified in text, Collage Diffusion learns specialized text representations per layer. Collage input also enables layer-based controls that provide fine-grained control over the final output: users can control image harmonization on a layer-by-layer basis, and they can edit individual objects in generated images while keeping other objects fixed. Collage-conditional image generation requires harmonizing the input collage to make objects fit together--the key challenge involves minimizing changes in the positions and key visual attributes of objects in the input collage while allowing other attributes of the collage to change in the harmonization process. By leveraging the rich information present in layer input, Collage Diffusion generates globally harmonized images that maintain desired object locations and visual characteristics better than prior approaches.
翻译:文本传播模型产生高质量、 多样的图像。 但是, 文本对于理想的目标图像来说, 往往是一个模糊的规格, 使用户能够控制现场对象的空间安排和视觉属性, 并且使用户能够编辑生成图像的单个特性。 我们注重对图像中多个对象的图像输出进行精确控制。 用户通过定义拼贴来控制图像生成。 用户控制图像生成: 文本与一个顺序排列的层相配, 每个层为 RGBA 图像和相应的文本提示。 我们引入了拼贴式拼贴式, 拼贴式- 套接式传播算法, 使用户能够控制现场对象的空间安排和视觉属性属性, 并使用户能够编辑生成图像的单个特性的单个组成部分。 用户可以在统一之前的图像生成过程中对各个位置进行调和, 用户可以对稳定的图像生成过程进行调和操作。 调和在最终输出的直观图像中, 用户可以对单个图像生成的单个图像进行调和操作, 固定的图像生成过程中可以对单个图像进行调整。</s>