Constrained Reinforcement Learning has been employed to enforce safety constraints on policy through the use of expected cost constraints. The key challenge is in handling expected cost accumulated using the policy and not just in a single step. Existing methods have developed innovative ways of converting this cost constraint over entire policy to constraints over local decisions (at each time step). While such approaches have provided good solutions with regards to objective, they can either be overly aggressive or conservative with respect to costs. This is owing to use of estimates for "future" or "backward" costs in local cost constraints. To that end, we provide an equivalent unconstrained formulation to constrained RL that has an augmented state space and reward penalties. This intuitive formulation is general and has interesting theoretical properties. More importantly, this provides a new paradigm for solving constrained RL problems effectively. As we show in our experimental results, we are able to outperform leading approaches on multiple benchmark problems from literature.
翻译:通过使用预期成本限制,在加强强化学习方面,已经采用受限制的学习方法,对政策实施安全限制,关键的挑战在于利用政策而不是仅仅一步地处理预期积累的成本。现有方法已经开发出创新的方法,将整个政策的成本限制转化为对当地决定的限制(每个阶段)。虽然这些方法在客观方面提供了良好的解决办法,但在成本方面可能过于激进或保守。这是因为使用估算法来计算当地成本限制中的“未来”或“后退”成本。为此,我们提供了相当的、不受限制的配方,以限制RL, 增加国家空间和奖励。这种直观的提法是一般性的,具有有趣的理论属性。更重要的是,这为有效解决受限制的RL问题提供了一种新的模式。正如我们在实验结果中所表明的,我们有能力在文献的多个基准问题上超越领先方法。