We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST). We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.
翻译:我们试图将代理行为与用户的目标匹配到一个具有未知动态、未知奖赏功能和未知不安全状态的强化学习环境中。 用户知道奖赏和不安全状态, 但询问用户费用昂贵。 为了应对这一挑战, 我们提出一个安全互动地学习用户奖赏功能模型的算法。 我们从初始状态的基因模型和一个经过离政策数据培训的远远动态模型开始。 我们的方法是用这些模型来综合假设行为, 要求用户用奖赏来标注行为, 并训练神经网络来预测奖赏。 关键的想法是积极综合从零到零的假设行为, 方法是在不与环境互动的情况下, 尽可能地利用可移动的代理来获取信息价值的值。 我们称之为这个方法, 通过轨迹优化( ReQueST), 我们用基于基于州基 2D 的导航任务和基于图像的 Car Racing 视频游戏的模拟用户来评估ReQueST 。 结果显示, ReQueST 大大超越了以前学习奖赏模型的方法, 向不同的初始分配代理机构转移新环境。 此外, Request ST 将安全性奖赏系统进行。