Although well-established in general reinforcement learning (RL), value-based methods are rarely explored in constrained RL (CRL) for their incapability of finding policies that can randomize among multiple actions. To apply value-based methods to CRL, a recent groundbreaking line of game-theoretic approaches uses the mixed policy that randomizes among a set of carefully generated policies to converge to the desired constraint-satisfying policy. However, these approaches require storing a large set of policies, which is not policy efficient, and may incur prohibitive memory costs in constrained deep RL. To address this problem, we propose an alternative approach. Our approach first reformulates the CRL to an equivalent distance optimization problem. With a specially designed linear optimization oracle, we derive a meta-algorithm that solves it using any off-the-shelf RL algorithm and any conditional gradient (CG) type algorithm as subroutines. We then propose a new variant of the CG-type algorithm, which generalizes the minimum norm point (MNP) method. The proposed method matches the convergence rate of the existing game-theoretic approaches and achieves the worst-case optimal policy efficiency. The experiments on a navigation task show that our method reduces the memory costs by an order of magnitude, and meanwhile achieves better performance, demonstrating both its effectiveness and efficiency.
翻译:尽管一般强化学习(RL)已经确立,但价值基础方法在有限的RL(CRL)中很少被探讨,因为其无法找到能够随机排列多种行动的政策。为了对CRL应用基于价值的方法,最近一种突破性的游戏理论方法使用混合政策,在一套精心制定的政策中随机排列,以达到所希望的制约性满足性政策。然而,这些方法需要储存大量政策,这种政策不是政策效率,在有限的深度RL中可能会产生令人望而却步的记忆成本。为了解决这个问题,我们建议了一种替代方法。我们的方法首先将CRL改造成相等的远程优化问题。在专门设计的线性优化或触雷器中,我们产生了一种元性平衡政策,用任何现成的RL算法和任何有条件的梯度算法作为子。我们随后提出了一套新的CG型算法的变式,将最低规范点(MNP)方法加以概括化。拟议的方法与现有的游戏-LL方法的趋同率相匹配,将它重新配置成一个相等的远程优化优化优化的问题。我们得出了最高级的实验方法,从而降低了我们最高级的实验性的工作效率。