It is well known that Reinforcement Learning (RL) can be formulated as a convex program with linear constraints. The dual form of this formulation is unconstrained, which we refer to as dual RL, and can leverage preexisting tools from convex optimization to improve the learning performance of RL agents. We show that several state-of-the-art deep RL algorithms (in online, offline, and imitation settings) can be viewed as dual RL approaches in a unified framework. This unification calls for the methods to be studied on common ground, so as to identify the components that actually contribute to the success of these methods. Our unification also reveals that prior off-policy imitation learning methods in the dual space are based on an unrealistic coverage assumption and are restricted to matching a particular f-divergence. We propose a new method using a simple modification to the dual framework that allows for imitation learning with arbitrary off-policy data to obtain near-expert performance.
翻译:众所周知,强化学习(RL)可以作为一种具有线性限制的螺旋形程序来制定。这种配方的双重形式是不受限制的,我们称之为双重RL,并且能够利用先前存在的工具,从螺旋优化中提高RL代理人的学习表现。我们表明,一些最先进的深层RL算法(在线、离线和仿制设置)可以在一个统一的框架内被视为双重的RL方法。这种统一要求在共同的基础上研究方法,以便确定真正有助于这些方法取得成功的成分。我们的统一还表明,以前在双空间的离政策模仿学习方法是基于不切实际的覆盖假设,并限于匹配特定的f-diverence。我们提出一种新方法,简单修改允许以任意的离政策数据进行模仿学习的双重框架,以获得近专家的性能。