For many reinforcement learning (RL) applications, specifying a reward is difficult. This paper considers an RL setting where the agent obtains information about the reward only by querying an expert that can, for example, evaluate individual states or provide binary preferences over trajectories. From such expensive feedback, we aim to learn a model of the reward that allows standard RL algorithms to achieve high expected returns with as few expert queries as possible. To this end, we propose Information Directed Reward Learning (IDRL), which uses a Bayesian model of the reward and selects queries that maximize the information gain about the difference in return between plausibly optimal policies. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. Moreover, it achieves similar or better performance with significantly fewer queries by shifting the focus from reducing the reward approximation error to improving the policy induced by the reward model. We support our findings with extensive evaluations in multiple environments and with different query types.
翻译:对于许多强化学习(RL) 应用程序来说, 指定奖赏是困难的。 本文考虑了一种RL设置, 代理商只能通过询问一位专家才能获得有关奖赏的信息, 例如, 专家可以对个别国家进行评估, 或者对轨迹提供二进制偏好。 从这种昂贵的反馈中, 我们的目标是学习一种奖赏模式, 使得标准的RL算法能够以尽可能少的专家查询来达到高预期的回报。 为此, 我们建议信息引导奖赏学习( IDRL), 它使用巴耶斯奖赏模式, 并选择查询, 以最大限度地获得关于优劣最佳政策之间回报差异的信息。 与以前为特定类型查询设计的积极奖赏学习方法不同, IDRL 自然会适应不同的查询类型。 此外, 它通过将重点从减少奖赏近似错误转向改善奖赏模式所引的政策, 实现相似或更好的成绩, 并且通过多种环境和不同查询类型进行广泛的评价来支持我们的调查结果 。