While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.
翻译:虽然强化学习(RL)已成为对机器人更受欢迎的方法,但设计足够信息化的复杂任务奖励功能已证明极其困难,因为无法捕捉人类意图和政策利用,因此无法捕捉人类意图和政策利用。基于偏好的RL算法试图通过直接学习人类反馈的奖赏功能来克服这些挑战。 不幸的是,先前的工作要么需要不合理的询问数量,因为任何人都无法回答或不过分限制奖励功能的类别,以保证获得最丰富的查询,从而导致对现实的机器人任务任务没有足够清晰的模型。 与大多数侧重于查询选择到\emph{最小化}学习奖励功能所需数据数量的工作相反,我们采取了相反的做法:\emph{Explanding}利用现有的数据库,通过更灵活的多塔克学习透镜来查看人中的人际流RL。 受元学习成功驱动,我们之前的任务数据偏好模型,并且仅使用少量查询来快速调整这些模型。我们降低了在线反馈的数量,我们减少了在线反馈的数量,从而能够通过更灵活的角度来培训人类操作政策。