Reinforcement Learning (RL) with constraints is becoming an increasingly important problem for various applications. Often, the average criterion is more suitable. Yet, RL for average criterion-constrained MDPs remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new (possibly the first) policy optimization algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by the famed PPO-type algorithms based on trust region methods. We develop basic sensitivity theory for average MDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging MuJoCo environments, show the superior performance of the algorithm when compared to other state-of-the-art algorithms adapted for the average CMDP setting.
翻译:受制约的强化学习(RL)正在成为各种应用中越来越重要的问题。通常,平均标准更适合。然而,平均标准限制的 MDP 的RL 仍是一个具有挑战性的问题。为折扣的受限制RL问题设计的分类对于平均 CMDP 环境来说往往效果不佳。在本文中,我们为有平均标准的限制 MDP 引入了新的(可能是第一种)政策优化算法。平均限制的政策优化算法(ACPO)受以信任区域方法为基础的著名的PPPO型算法的启发。我们为平均 MDP 开发基本敏感度理论,然后在算法的设计中使用相应的界限。我们从理论上保证其性能,并通过在具有挑战性的各种MuJoCo环境中开展广泛的实验工作,显示与适应平均 CMDP 环境的其他最先进的算法相比,算法表现优。