Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, f-DPG, which allows the use of any f-divergence to approximate any target distribution. f-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences are good for approximating different targets. For instance, we discover that for GDC, the Jensen-Shannon divergence frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work.
翻译:将语言模式与偏好相调和,可以作为代表某种理想行为的目标分布的近似目标分布模式。现有的方法在目标分布的功能形式和用来接近目标分布的算法上都有所不同。例如,从人类反馈中强化学习(RLHF)相当于从目标中因KL处罚产生的隐含目标分布中将反向的KL最小化。另一方面,产生分布控制(GDC)有一个明确的目标分布,并使用分配政策分级(DPG)算法将前向的KL值最小化。在本文中,我们提出了一种新的方法,即F-DPG,允许使用任何F-DPG,以近似目标分布的任何变量。f-DPG将框架(RHF、GDC)和近似方法(DPG、RL加KL惩罚)统一起来。我们展示了各种差异目标选择的实际好处,并表明没有普遍的最佳目标,但不同的差异对接近不同目标是好的。例如,我们发现,对于GDC来说,Jen-Shannon差异常常超越前差差差差,导致前K差差差。