Preference-based reinforcement learning (PbRL) can enable robots to learn to perform tasks based on an individual's preferences without requiring a hand-crafted reward function. However, existing approaches either assume access to a high-fidelity simulator or analytic model or take a model-free approach that requires extensive, possibly unsafe online environment interactions. In this paper, we study the benefits and challenges of using a learned dynamics model when performing PbRL. In particular, we provide evidence that a learned dynamics model offers the following benefits when performing PbRL: (1) preference elicitation and policy optimization require significantly fewer environment interactions than model-free PbRL, (2) diverse preference queries can be synthesized safely and efficiently as a byproduct of standard model-based RL, and (3) reward pre-training based on suboptimal demonstrations can be performed without any environmental interaction. Our paper provides empirical evidence that learned dynamics models enable robots to learn customized policies based on user preferences in ways that are safer and more sample efficient than prior preference learning approaches.
翻译:以优惠为基础的强化学习(PbRL)可以使机器人学会在个人偏好的基础上执行任务,而不需要手工制作的奖赏功能;然而,现有的方法要么假定可以使用高纤维模拟器或分析模型,要么采取无模式的办法,需要广泛的、可能不安全的在线环境互动;在本文中,我们研究在进行PbRL时使用学习的动态模型的好处和挑战。特别是,我们提供证据表明,一个学习的动态模型在执行PbRL时可以带来以下好处:(1) 优惠的吸引和政策优化需要比没有模型的PbRL少得多的环境互动;(2) 各种优惠询问可以安全有效地合成,作为基于标准模型的RL的副产品;(3) 以亚优化演示为基础的奖励性培训前工作可以在不进行任何环境互动的情况下进行;我们的文件提供了经验证据,表明学习的动态模型能够使机器人学习基于用户偏好的政策,其方式比先前的优惠学习方法更安全、更有效率。