In recent years, Reinforcement Learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate convergence, i.e., the current policy of the agent may never converge to the optimal solution. In practice, it is often observed that the policy alternates between satisfying the constraints and maximizing the reward, rarely accomplishing both objectives simultaneously. Here, we address this problem by introducing Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterate convergence. We demonstrate its empirical effectiveness on a wide variety of CRL problems including discrete MDPs and continuous control. In the process we establish a benchmark of challenging CRL problems.
翻译:近些年来,强化学习(RL)被应用于现实世界中越来越成功的问题,这些应用往往要求限制代理人的行为。限制学习(CRL)的现有算法依赖于梯度的下降,但这一方法附带一个警告。虽然这些算法保证平均趋同,但并不能保证最后的趋同,即代理人的现行政策可能永远无法达到最佳的解决办法。在实践中,人们常常注意到,在满足限制和尽量扩大奖励之间,政策会有所替代,很少同时实现这两个目标。在这里,我们通过采用“以最佳的日光速为主的强化学习”(RELAD)来解决这个问题,这是一条有保障最后一种率趋同的有原则的CRLAD方法。我们展示了它在一系列广泛的CRL问题上的经验效力,包括离散的 MDPs 和持续控制。在这一进程中,我们建立了挑战CRL问题的基准。