Many physical systems have underlying safety considerations that require that the policy employed ensures the satisfaction of a set of constraints. The analytical formulation usually takes the form of a Constrained Markov Decision Process (CMDP). We focus on the case where the CMDP is unknown, and RL algorithms obtain samples to discover the model and compute an optimal constrained policy. Our goal is to characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy -- both objective maximization and constraint satisfaction -- in a PAC sense. We explore two classes of RL algorithms, namely, (i) a generative model based approach, wherein samples are taken initially to estimate a model, and (ii) an online approach, wherein the model is updated as samples are obtained. Our main finding is that compared to the best known bounds of the unconstrained regime, the sample complexity of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints, which suggests that the approach may be easily utilized in real systems.
翻译:许多物理系统都具有基本的安全考虑,要求所采用的政策能够确保满足一系列限制。分析的提法通常采取 " Constrate Markov决定程序 " 的形式。我们侧重于CMDP未知的情况,RL算法获得样本以发现模型并计算最佳限制政策。我们的目标是从PAC意义上确定安全限制与为确保达到理想的准确度 -- -- 客观最大化和限制满意度 -- -- 所需的样本数量之间的关系。我们探索了两种RL算法,即:(一) 基于基因的模型法,首先采用样本来估计模型,和(二)在线法,模型在采集样本时更新。我们的主要发现是,与未受限制制度的最佳已知界限相比,受限制的RL算法的样本复杂性因制约数量的对数不同而增加,这表明该方法很容易在实际系统中使用。