Offline Reinforcement Learning methods seek to learn a policy from logged transitions of an environment, without any interaction. In the presence of function approximation, and under the assumption of limited coverage of the state-action space of the environment, it is necessary to enforce the policy to visit state-action pairs close to the support of logged transitions. In this work, we propose an iterative procedure to learn a pseudometric (closely related to bisimulation metrics) from logged transitions, and use it to define this notion of closeness. We show its convergence and extend it to the function approximation setting. We then use this pseudometric to define a new lookup based bonus in an actor-critic algorithm: PLOFF. This bonus encourages the actor to stay close, in terms of the defined pseudometric, to the support of logged transitions. Finally, we evaluate the method on hand manipulation and locomotion tasks.
翻译:离线强化学习方法试图从记录的环境转型中学习一项政策,而不进行任何互动。 在存在功能近似的情况下,并且假设国家环境行动空间的覆盖范围有限的情况下,有必要执行这一政策,访问接近记录过渡支持的州对行动。在这项工作中,我们提议了一个迭代程序,从记录过渡中学习假数(与辅助指标密切相关),并用它来定义这种近距离概念。我们展示其趋同性,并将其扩展至功能近似设置。我们随后使用这个假数来定义一种基于新外观的基于演员-弧形算法的奖金:PLOFF。这一奖金鼓励行为者在界定的伪计法上保持接近于对记录过渡的支持。最后,我们评价手控和 Locomove 任务的方法。