Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.
翻译:通过基于偏好的策略优化引导大型语言模型,为在不依赖大量人工标注的情况下使模型行为与人类偏好对齐提供了一条有前景的路径。本文提出了一种新颖的基于偏好的策略优化框架,该框架将学习过程构建为主策略与奖励模型之间的极小极大博弈。奖励模型被约束在从偏好数据导出的置信集内,以确保可靠的利用。我们的迭代在线算法通过对演化策略的引导式探索主动收集偏好数据,从而实现策略与奖励模型的持续自我改进。我们为该方法的理论保证,为序列级奖励模型和令牌级奖励模型两种设置建立了高概率遗憾界,证明了其在引导大型语言模型方面的有效性。在五个基准测试上的大量实验表明,我们的方法始终优于现有的最先进偏好优化技术。