For many reinforcement learning (RL) applications, specifying a reward is difficult. In this paper, we consider an RL setting where the agent can obtain information about the reward only by querying an expert that can, for example, evaluate individual states or provide binary preferences over trajectories. From such expensive feedback, we aim to learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. For this purpose, we propose Information Directed Reward Learning (IDRL), which uses a Bayesian model of the reward function and selects queries that maximize the information gain about the difference in return between potentially optimal policies. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. Moreover, by shifting the focus from reducing the reward approximation error to improving the policy induced by the reward model, it achieves similar or better performance with significantly fewer queries. We support our findings with extensive evaluations in multiple environments and with different types of queries.
翻译:对于许多强化学习(RL) 应用程序来说, 指定奖赏是困难的。 在本文中, 我们考虑一个RL设置, 代理商只能通过询问专家才能获得有关奖赏的信息, 例如, 专家可以对个别国家进行评估, 或者对轨迹提供二进制偏好。 从如此昂贵的反馈, 我们的目标是学习一个奖赏功能模式, 使标准的RL算法能够以尽可能少的专家查询实现高预期回报。 为此, 我们建议信息引导奖赏学习( IDRL), 使用巴耶斯奖赏功能模式, 并选择可以最大限度地获取关于潜在最佳政策之间差异的信息的查询。 与以前为特定类型查询设计的积极奖赏学习方法不同, IDRL 自然会适应不同的查询类型。 此外, 通过将重点从减少奖赏近似差转向改进奖赏模式引发的政策, 以少得多的查询实现相似或更好的业绩。 我们支持我们的调查结果, 在许多环境中进行广泛的评估, 并且有不同种类的查询。