We study constrained reinforcement learning (CRL) from a novel perspective by setting constraints directly on state density functions, rather than the value functions considered by previous works. State density has a clear physical and mathematical interpretation, and is able to express a wide variety of constraints such as resource limits and safety requirements. Density constraints can also avoid the time-consuming process of designing and tuning cost functions required by value function-based constraints to encode system specifications. We leverage the duality between density functions and Q functions to develop an effective algorithm to solve the density constrained RL problem optimally and the constrains are guaranteed to be satisfied. We prove that the proposed algorithm converges to a near-optimal solution with a bounded error even when the policy update is imperfect. We use a set of comprehensive experiments to demonstrate the advantages of our approach over state-of-the-art CRL methods, with a wide range of density constrained tasks as well as standard CRL benchmarks such as Safety-Gym.
翻译:我们从新的角度研究限制强化学习(CRL),方法是直接限制国家密度功能,而不是以前作品所考虑的价值功能。 国家密度具有明确的物理和数学解释,能够表达资源限制和安全要求等各种限制。 密度限制也可以避免基于价值功能的限制而要求设计和调整成本功能以编码系统规格的耗时过程。 我们利用密度功能和Q功能之间的双重性来开发一种有效的算法,以最佳的方式解决限制密度的RL问题,保证满足这些限制。 我们证明,即使政策更新不完善,拟议的算法也接近于最优化的解决方案,并带有受限制的错误。 我们使用一系列全面实验来展示我们的方法相对于基于价值功能的限制的CRL方法的优势,其密度限制任务范围广泛,以及标准CRL基准,例如安全-吉姆。