We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL.
翻译:我们提供了一个强化学习与人类反馈(RLHF)的理论框架。我们的分析表明,当真实的奖励函数是线性的时候,在 Bradley-Terry-Luce(BTL)模型和 Plackett-Luce(PL)模型下广泛使用的最大似然估计器(MLE)均收敛。然而,我们展示了当基于学习的奖励模型训练策略时,MLE 会失败,而一种悲观的 MLE 则在某些覆盖率假设下提供了性能更好的策略。此外,我们证明在 PL 模型下,真实的MLE 和将 $K$-元比较分成 配对比较的备选MLE 都收敛。而且,真实的MLE渐近效率更高。我们的结果验证了现有RLHF算法在InstructGPT中的经验成功,并为算法设计提供了新的见解。此外,我们的结果统一了RLHF和最大熵逆强化学习(IRL)问题,并为最大熵IRL提供了第一个样本复杂性界。