We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to include hidden state information that captures temporal dependencies in human assessment of trajectories. We then show how RM can be approached as a multiple instance learning (MIL) problem, and develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and that they provide interpretable learnt hidden information that can be used to train high-performing agent policies.
翻译:现有工作假设人类评价员在提供关于代理人行为的反馈时独立观察轨迹中的每一个步骤。在这项工作中,我们删除了这一假设,将RM扩大到包括隐蔽的国家信息,以捕捉对轨迹的人类评估中的时间依赖性。然后我们展示如何将RM作为一个多例学习问题来处理,并开发新的MIL模型,能够捕捉标记的轨迹中的时间依赖性。我们在一系列RL任务中显示,我们新的MIL模型可以将奖励功能重建到高度准确性,并且它们提供了可解释的、可被用于培训高性能代理政策的知识隐蔽信息。