In this work we theoretically show that conservative objective models (COMs) for offline model-based optimisation (MBO) are a special kind of contrastive divergence-based energy model, one where the energy function represents both the unconditional probability of the input and the conditional probability of the reward variable. While the initial formulation only samples modes from its learned distribution, we propose a simple fix that replaces its gradient ascent sampler with a Langevin MCMC sampler. This gives rise to a special probabilistic model where the probability of sampling an input is proportional to its predicted reward. Lastly, we show that better samples can be obtained if the model is decoupled so that the unconditional and conditional probabilities are modelled separately.
翻译:----
在本研究中,我们理论上证明了保守目标模型(COMs)用于离线模型优化(MBO)是一种特殊类型的对比散度能量模型,即能量函数表示了输入的无条件概率和奖励变量的条件概率。虽然最初的建模方法仅从其学习的分布中采样模式,但我们提出了一个简单的修正措施,将其梯度上升采样器替换为Langevin MCMC采样器。这导致产生一种特殊的概率模型,其中采样输入的概率与其预测的奖励成正比。最后,我们表明如果将模型分离开来,分别对无条件概率和条件概率进行建模,就能够获得更好的样本。