Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.
翻译:基于人类反馈的强化学习(RLHF)已成为后训练大型语言模型的关键技术。尽管其经验上取得了成功,但RLHF的理论理解仍然有限,因为仅通过偏好反馈学习KL正则化目标相比经典强化学习带来了额外挑战。现有研究大多基于奖励驱动的Bradley-Terry(BT)偏好模型,并扩展了利用乐观或悲观估计的经典设计。本文则考虑通用偏好模型(其实际相关性近期已被观察到),并获得了比现有成果在数量级上显著改进的性能保证。令人惊讶的是,这些结果源自直接使用经验估计(即贪婪采样)的算法,而非先前工作中构建的乐观或悲观估计。这一洞见根植于KL正则化目标下最优策略类的独特结构特性,我们进一步将其专门应用于BT模型,凸显了贪婪采样在RLHF中的惊人充分性。