Learning a reward function from human preferences is challenging as it typically requires having a high-fidelity simulator or using expensive and potentially unsafe actual physical rollouts in the environment. However, in many tasks the agent might have access to offline data from related tasks in the same target environment. While offline data is increasingly being used to aid policy optimization via offline RL, our observation is that it can be a surprisingly rich source of information for preference learning as well. We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning, learns a distribution over reward functions, and optimizes a corresponding policy via offline RL. Crucially, our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps. To test our approach, we first evaluate existing offline RL benchmarks for their suitability for offline reward learning. Surprisingly, for many offline RL domains, we find that simply using a trivial reward function results good policy performance, making these domains ill-suited for evaluating learned rewards. To address this, we identify a subset of existing offline RL benchmarks that are well suited for offline reward learning and also propose new offline apprenticeship learning benchmarks which allow for more open-ended behaviors. When evaluated on this curated set of domains, our empirical results suggest that combining offline RL with learned human preferences can enable an agent to learn to perform novel tasks that were not explicitly shown in the offline data.
翻译:从人类偏好中学习奖赏功能具有挑战性,因为通常需要有一个高忠诚模拟器,或者使用昂贵和可能不安全的实际环境实际推出。然而,在许多任务中,该代理人可能能够从同一目标环境中的相关任务中获得离线数据。虽然离线数据正越来越多地用于通过离线RL协助政策优化,但我们的观察是,它也可以成为用于偏爱学习的令人惊讶的丰富信息来源。我们建议采用一种方法,即使用离线数据集,通过基于集合的积极学习来进行偏好查询,学习对奖赏的分布功能,并通过离线RL优化相应的政策。关键是,我们拟议的方法并不需要实际推出或准确模拟在同一目标环境中的相关任务。虽然离线数据正越来越多地用于通过离线RL来帮助政策优化政策,但我们首先评估现有的离线RL基准是否适合离线学习。我们发现,仅仅使用微不足道的奖赏功能可以产生良好的政策绩效,使这些领域不适于通过离线评价。为了解决这个问题,我们确定一个比新学习的、比新学习的R基准更符合当前学习的R标准。