Guidance techniques are commonly used in diffusion and flow models to improve image quality and input consistency for conditional generative tasks such as class-conditional and text-to-image generation. In particular, classifier-free guidance (CFG) is the most widely adopted guidance technique. It results, however, in trade-offs across quality, diversity and consistency: improving some at the expense of others. While recent work has shown that it is possible to disentangle these factors to some extent, such methods come with an overhead of requiring an additional (weaker) model, or require more forward passes per sampling step. In this paper, we propose Entropy Rectifying Guidance (ERG), a simple and effective guidance method based on inference-time changes in the attention mechanism of state-of-the-art diffusion transformer architectures, which allows for simultaneous improvements over image quality, diversity and prompt consistency. ERG is more general than CFG and similar guidance techniques, as it extends to unconditional sampling. We show that ERG results in significant improvements in various tasks, including text-to-image, class-conditional and unconditional image generation. We also show that ERG can be seamlessly combined with other recent guidance methods such as CADS and APG, further improving generation results.
翻译:引导技术广泛应用于扩散模型与流模型中,旨在提升条件生成任务(如类别条件生成与文本到图像生成)的图像质量与输入一致性。其中,无分类器引导(CFG)是目前最广泛采用的引导技术。然而,该方法需要在质量、多样性与一致性之间进行权衡:提升某些指标往往以牺牲其他指标为代价。尽管近期研究表明可以在一定程度上解耦这些因素,但此类方法通常需要额外引入(性能较弱的)模型,或在每个采样步骤中增加前向传播次数。本文提出熵校正引导(ERG),这是一种基于最先进扩散Transformer架构中注意力机制在推理时变化的简单而有效的引导方法,能够同时提升图像质量、多样性与提示一致性。ERG比CFG及类似引导技术更具普适性,因其可扩展至无条件采样场景。实验表明,ERG在文本到图像生成、类别条件生成及无条件图像生成等多种任务中均带来显著改进。同时,ERG能够与CADS、APG等最新引导方法无缝结合,进一步优化生成结果。