We consider the problem of learning from observation (LfO), in which the agent aims to mimic the expert's behavior from the state-only demonstrations by experts. We additionally assume that the agent cannot interact with the environment but has access to the action-labeled transition data collected by some agents with unknown qualities. This offline setting for LfO is appealing in many real-world scenarios where the ground-truth expert actions are inaccessible and the arbitrary environment interactions are costly or risky. In this paper, we present LobsDICE, an offline LfO algorithm that learns to imitate the expert policy via optimization in the space of stationary distributions. Our algorithm solves a single convex minimization problem, which minimizes the divergence between the two state-transition distributions induced by the expert and the agent policy. Through an extensive set of offline LfO tasks, we show that LobsDICE outperforms strong baseline methods.
翻译:我们考虑了从观察(LfO)中学习的问题,在观察(LfO)中,代理人的目的是模仿专家的行为,从专家的国营演示中模仿专家的行为。我们还假设代理人无法与环境互动,但能够查阅一些性质不明的代理人收集的带有行动标签的过渡数据。LfO的这种脱线设置在许多现实世界情景中具有吸引力,即地面真相专家的行动无法进入,任意的环境互动成本高或风险大。在本文中,我们介绍了LobsDICE(LobsDICE)这一离线的算法,它通过固定分布空间的优化学习模仿专家政策。我们的算法解决了一个单一的螺旋体最小化问题,它最大限度地缩小了专家引起的两种状态过渡分布与代理人政策之间的差距。我们通过广泛的一系列脱线LfO任务,显示LobsDICE(LobsDICE)超越了强大的基线方法。