Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.
翻译:直接偏好优化(DPO)通过直接优化人类偏好而无需显式奖励模型,简化了大型语言模型(LLM)的人类反馈强化学习(RLHF)。我们发现,在DPO训练过程中,参考模型扮演着数据权重调节器的角色。然而,DPO中常见的将策略模型与参考模型初始化为相同模型的做法,可能导致数据利用效率低下并形成性能瓶颈。与此同时,简单偏好优化(SimPO)因缺乏参考模型而降低了训练鲁棒性,并需要更严格的条件以防止灾难性遗忘。本研究提出Pre-DPO——一种基于DPO的简洁高效训练范式,通过引入引导参考模型来提升偏好优化性能。该参考模型能够预判通过训练偏好数据可达的最优策略状态,作为引导机制自适应地为更适合模型的样本分配更高权重,对较不适合的样本分配较低权重。在AlpacaEval 2.0和Arena-Hard v0.1基准上的大量实验表明,Pre-DPO在不依赖外部模型或额外数据的情况下,持续提升了DPO与SimPO的性能表现。