Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger contrastiveness and clearer preference signals. We theoretically show that conditioning on hints increases the expected preference margin through mutual information and improves sample efficiency while remaining within the policy distribution family. Empirically, RPO achieves superior alignment with fewer training samples and iterations, substantially reducing hallucination rates and delivering state-of-the-art performance across multimodal benchmarks.
翻译:直接偏好优化(DPO)已成为一种轻量级且有效的替代方案,用于替代基于人类反馈的强化学习(RLHF)和基于AI反馈的强化学习(RLAIF),以对齐大型语言模型和视觉语言模型。然而,标准DPO公式中,被选中和被拒绝的响应均由同一策略生成,由于两者常共享相似错误且具有较小的Kullback-Leibler(KL)散度,导致学习信号较弱。这造成收敛缓慢且不稳定。为解决这一局限,我们提出反思偏好优化(RPO),这是一个将提示引导的反思融入DPO范式的新框架。RPO利用外部模型识别幻觉来源并生成简洁的反思提示,从而构建具有更强对比性和更清晰偏好信号的同策略偏好对。我们从理论上证明,基于提示的条件化通过互信息增加了期望偏好边界,并在保持策略分布族不变的同时提高了样本效率。实证表明,RPO以更少的训练样本和迭代次数实现了更优的对齐效果,显著降低了幻觉率,并在多模态基准测试中取得了最先进的性能。