Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.
翻译:自引入以来,传播模型很快成为许多领域基因模型的流行方法,可以被解释为学习日志概率密度函数时间序列的梯度。这种解释促使以分类器和不分类器指导作为传播模型后热控制的方法。在这项工作中,我们利用对传播模型的分数解释,利用这些想法,探索其他方法,为涉及组成生成和指导的任务确定、修改和再利用传播模型。特别是,我们调查为什么某些类型的组成方法使用当前技术失败,并提出若干解决办法。我们的结论是,取样器(而不是模型)应对这一失败负责,并提议新的取样器,由MCMC启发,使成功生成成份生成。此外,我们提议以能源为基础的传播模型参数,以便能够使用新的组成操作器和更复杂的、经Metopolis校正的采样器。令人感兴趣的是,我们发现这些采样器导致在诸如分类器制图像网络建模和成文本生成等一系列广泛的问题中,形成成型都有显著的改进。