Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale.
翻译:扩散模型在图像生成质量方面取得了显著的进展,特别是在使用指导来控制生成过程时。然而,指导需要大量的图像 - 注释对进行训练,因此取决于它们的可用性、正确性和无偏性。在本文中,我们通过利用自监督信号的灵活性来设计一个自主导向扩散模型框架,从而消除了这种注释的需要。通过利用特征提取函数和自注释函数,我们的方法在各种图像粒度上提供指导信号:从整体图像的级别到对象框甚至分割掩模。我们对单标签和多标签图像数据集的实验表明,自标记的指导总是优于没有指导的扩散模型,并且甚至可以超过基于真实标签的指导,特别是在不平衡的数据上。当配备自监督的框或掩模提议时,我们的方法进一步生成视觉上多样但语义上一致的图像,无需任何类别、框或分割标签注释。自主导向扩散是简单、灵活的,并预计从规模化部署中获益。