Text-guided diffusion models such as DALLE-2, IMAGEN, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are very high quality as well. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. Unfortunately, this capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work we take a particularly straightforward approach to providing the needed direction, by injecting ``activation'' at desired positions in the cross-attention maps corresponding to the objects under control, while attenuating the remainder of the map. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. To the best of our knowledge, our Directed Diffusion method is the first diffusion technique that provides positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.
翻译:文本制扩散模型, 如 DALLE-2, IMAGEN 和 Splace Difmission 等文本制导的传播模型, 能够产生有效无穷无尽的图像种类, 仅提供一段简短的文字, 描述想要的图像内容。 在许多情况下, 图像质量非常高 。 但是, 这些模型往往难以拼凑包含多个关键对象的场景, 如字符在特定的定位关系中。 不幸的是, 这种“ 直接将字符和对象放置在图像内部和图像之间” 的能力对于讲述故事至关重要, 正如电影和动画理论的文献所承认的那样。 在这项工作中, 我们采取了一种特别直截了当的方法来提供必要的方向, 在与受控制对象相对应的交叉注意地图的位置上注入“ 激活 ”, 并同时减小地图的剩余部分。 由此形成的方法是将文本制导扩散模型模型模型的通用性, 以收集相关图像, 如在故事书中一样, 。 根据我们的知识, 我们的定向扩散方法是第一个对多个对象进行定位控制的方法,, 同时使用一个前训练前的模型和保持一个连贯的组合 。</s>