By formulating data samples' formation as a Markov denoising process, diffusion models achieve state-of-the-art performances in a collection of tasks. Recently, many variants of diffusion models have been proposed to enable controlled sample generation. Most of these existing methods either formulate the controlling information as an input (i.e.,: conditional representation) for the noise approximator, or introduce a pre-trained classifier in the test-phase to guide the Langevin dynamic towards the conditional goal. However, the former line of methods only work when the controlling information can be formulated as conditional representations, while the latter requires the pre-trained guidance classifier to be differentiable. In this paper, we propose a novel framework named RGDM (Reward-Guided Diffusion Model) that guides the training-phase of diffusion models via reinforcement learning (RL). The proposed training framework bridges the objective of weighted log-likelihood and maximum entropy RL, which enables calculating policy gradients via samples from a pay-off distribution proportional to exponential scaled rewards, rather than from policies themselves. Such a framework alleviates the high gradient variances and enables diffusion models to explore for highly rewarded samples in the reverse process. Experiments on 3D shape and molecule generation tasks show significant improvements over existing conditional diffusion models.
翻译:将数据样本的生成描述为马尔可夫去噪过程,可以实现扩散模型在多项任务中的最优性能。最近,许多扩散模型的变体已被提出,以实现可控的样本生成。这些现有的大部分方法要么将控制信息表述为噪声逼近器的输入(即条件表示),要么在测试阶段引入预训练的分类器来引导 Langevin 动力学朝向条件目标。然而,前者在控制信息可以作为条件表示时才能工作,后者则需要预训练的引导分类器是可微分的。在本文中,我们提出了一种名为 RGDM(奖励引导扩散模型)的新框架,通过强化学习(RL)引导扩散模型的训练阶段。所提出的训练框架将加权对数似然和最大熵 RL 的目标联系起来,使得可以通过来自按指数尺度奖赏比例的支付分布的样本计算策略梯度,而不是通过策略本身。这种框架减轻了高梯度变化,并使扩散模型在反向过程中探索高度奖励的样本成为可能。在3D形状和分子生成任务上的实验证明,与现有的条件扩散模型相比,RGDM取得了显著的改进。