Following generative adversarial networks (GANs), a de facto standard model for image generation, denoising diffusion models (DDMs) have been actively researched and attracted strong attention due to their capability to generate images with high quality and diversity. However, the way the internal self-attention mechanism works inside the UNet of DDMs is under-explored. To unveil them, in this paper, we first investigate the self-attention operations within the black-boxed diffusion models and build hypotheses. Next, we verify the hypotheses about the self-attention map by conducting frequency analysis and testing the relationships with the generated objects. In consequence, we find out that the attention map is closely related to the quality of generated images. On the other hand, diffusion guidance methods based on additional information such as labels are proposed to improve the quality of generated images. Inspired by these methods, we present label-free guidance based on the intermediate self-attention map that can guide existing pretrained diffusion models to generate images with higher fidelity. In addition to the enhanced sample quality when used alone, we show that the results are further improved by combining our method with classifier guidance on ImageNet 128x128.
翻译:继基因对抗网络(GANs)之后,一个事实上的图像生成标准模型(DDMs)被积极研究,并因其能生成高质量和多样性的图像而引起强烈关注。然而,在DDDMs UNet 内部自留机制的运作方式尚未得到充分探索。我们首先在本文中调查黑箱扩散模型中的自留操作,然后建立假说。接下来,我们通过进行频率分析并测试与生成对象的关系来核查关于自留地图的假说。因此,我们发现注意图与生成图像的质量密切相关。另一方面,根据标签等额外信息,建议采用传播指导方法来提高生成图像的质量。受这些方法的启发,我们根据中间自留图提出无标签指导,可以指导现有的预留式传播模型产生更高忠诚的图像。除了单独使用提高的样本质量外,我们还表明,通过将我们的方法128-128与图像网络与图像分析指导相结合,结果得到进一步改进。