We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a $O(\frac{1}{Δ\cdotε}\cdot\log^2(1/ε))$ sample complexity bound, with $Δ$ being a problem-dependent parameter, yet independent of $ε$. Our sample complexity bound improves upon the state-of-art $O(1/ε^2)$ sample complexity for CMDP problems established in the previous literature, in terms of the dependency on $ε$. To achieve this advance, we develop a new framework for analyzing CMDP problems. To be specific, our algorithm operates in the primal space and we resolve the primal LP for the CMDP problem at each period in an online manner, with adaptive remaining resource capacities. The key elements of our algorithm are: i) a characterization of the instance hardness via LP basis, ii) an eliminating procedure that identifies one optimal basis of the primal LP, and; iii) a resolving procedure that is adaptive to the remaining resources and sticks to the characterized optimal basis.
翻译:本文研究约束马尔可夫决策过程(CMDP)的强化学习问题,该问题在序列学习与决策过程中满足安全性或资源约束方面具有核心作用。在此问题中,我们拥有有限资源以及一个转移概率未知的MDP。在每个阶段,我们采取行动以获取奖励并消耗部分资源,这些参数均假设为未知且需随时间学习。本工作中,我们首次推导出CMDP问题的最优问题依赖性能保证。我们提出了对数遗憾界,可转化为 $O(\\frac{1}{Δ\\cdotε}\\cdot\\log^2(1/ε))$ 的样本复杂度界限,其中 $Δ$ 为问题依赖参数且独立于 $ε$。相较于现有文献中建立的 $O(1/ε^2)$ CMDP样本复杂度,我们的结果在 $ε$ 依赖维度上实现了改进。为实现这一进展,我们开发了分析CMDP问题的新框架。具体而言,我们的算法在原始空间运行,并以在线方式在每个周期求解CMDP问题的原始线性规划,同时自适应调整剩余资源容量。算法的关键要素包括:i) 通过LP基刻画实例复杂度;ii) 识别原始LP最优基的消元过程;iii) 适应剩余资源并保持已刻画最优基的解析过程。