Constrained reinforcement learning (RL) is an area of RL whose objective is to find an optimal policy that maximizes expected cumulative return while satisfying a given constraint. Most of the previous constrained RL works consider expected cumulative sum cost as the constraint. However, optimization with this constraint cannot guarantee a target probability of outage event that the cumulative sum cost exceeds a given threshold. This paper proposes a framework, named Quantile Constrained RL (QCRL), to constrain the quantile of the distribution of the cumulative sum cost that is a necessary and sufficient condition to satisfy the outage constraint. This is the first work that tackles the issue of applying the policy gradient theorem to the quantile and provides theoretical results for approximating the gradient of the quantile. Based on the derived theoretical results and the technique of the Lagrange multiplier, we construct a constrained RL algorithm named Quantile Constrained Policy Optimization (QCPO). We use distributional RL with the Large Deviation Principle (LDP) to estimate quantiles and tail probability of the cumulative sum cost for the implementation of QCPO. The implemented algorithm satisfies the outage probability constraint after the training period.
翻译:强化强化学习(RL)是RL的一个领域,其目标是找到最佳政策,在满足特定限制的同时最大限度地实现预期的累积回报,前一个受限制的RL将预期的累积总成本视为限制因素。然而,这种限制的优化并不能保证累计总成本超过某一阈值的断流目标概率。本文件提议了一个框架,名为Qatile Constrain RL(QCRL),以限制累积总成本分布的量化,这是满足断流限制的一个必要和充分的条件。这是解决将政策梯度定律应用于量的问题的首项工作,为四分法的梯度提供了理论结果。根据衍生的理论结果和拉格兰奇乘数的技术,我们构建了一个受限制的RL算法,名为Quantile Constraged Political Oppimization(QCPO)。我们使用大脱轨原则(LDP)来估计实施CPOA后累积总成本的不确定性。