Since reward functions are hard to specify, recent work has focused on learning policies from human feedback. However, such approaches are impeded by the expense of acquiring such feedback. Recent work proposed that agents have access to a source of information that is effectively free: in any environment that humans have acted in, the state will already be optimized for human preferences, and thus an agent can extract information about what humans want from the state. Such learning is possible in principle, but requires simulating all possible past trajectories that could have led to the observed state. This is feasible in gridworlds, but how do we scale it to complex tasks? In this work, we show that by combining a learned feature encoder with learned inverse models, we can enable agents to simulate human actions backwards in time to infer what they must have done. The resulting algorithm is able to reproduce a specific skill in MuJoCo environments given a single state sampled from the optimal policy for that skill.
翻译:由于奖赏功能很难具体说明,最近的工作重点是从人类反馈中学习政策。然而,这些方法受到获取这种反馈的成本的阻碍。最近的工作建议,代理人可以获取一个有效自由的信息来源:在人类已经采取行动的任何环境中,国家已经为人类偏好优化了,因此一个代理人可以从国家获取关于人类想要什么的信息。这种学习原则上是可能的,但需要模拟所有可能导致被观察状态的以往可能的轨迹。在网格世界中,这是可行的,但是我们如何将它推广到复杂的任务中?在这项工作中,我们表明,通过将一个学到的特征编码器与学习的反向模型结合起来,我们可以让代理人在时间上模拟人类的行动,以推断他们必须做什么。由此产生的算法能够在MuJoCo环境中复制一种特定的技能,从这一技能的最佳政策中抽取出一个单一的状态。