Constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds. In this class of problems, we show a simple example in which the desired optimal policy cannot be induced by any linear combination of rewards. Hence, there exist constrained reinforcement learning problems for which neither regularized nor classical primal-dual methods yield optimal policies. This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods as the portion of the dynamics that drives the multipliers evolution. This approach provides a systematic state augmentation procedure that is guaranteed to solve reinforcement learning problems with constraints. Thus, while primal-dual methods can fail at finding optimal policies, running the dual dynamics while executing the augmented policy yields an algorithm that provably samples actions from the optimal policy.
翻译:受约束的强化学习涉及多重奖励,必须逐个积累到给定的阈值。 在这类问题中,我们展示了一个简单的例子,即理想的最佳政策不能通过任何线性综合奖励来诱发。因此,存在制约性的强化学习问题,对于这些问题,正规化或经典的原始-双重方法都没有产生最佳政策。 这项工作通过用拉格朗格乘数增强状态,以及重新解释原始-双重方法作为驱动乘数演变的动态的一部分来弥补这一缺陷。 这种方法提供了一个系统化的州强化程序,保证在制约下解决强化学习问题。 因此,尽管初等方法在寻找最佳政策时可能无法成功,但在实施强化政策的同时运行双重动态则产生一种算法,从最佳政策中可明显地示例行动。