In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.
翻译:本文表明,简单偏好优化(SimPO)可推导为最大熵强化学习,从而为这一无参考方法提供了理论基础。受SimPO在离线偏好优化中优异表现的启发,我们探究最大熵强化学习能否在在线RLHF场景中取得类似效果。实验发现,即使在学习率极低的情况下,最大熵强化学习仍持续表现出过度优化及不稳定的KL动态。与保持稳定训练的KL约束方法不同,熵正则化无法阻止奖励黑客行为,且似乎与过度优化现象相关。最后,我们探讨了SimPO在离线场景中成功而最大熵强化学习在在线场景中失效的可能原因。研究结果表明,无参考方法在应用于在线或离线偏好学习时可能面临截然不同的挑战。