In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied. We define the concept of probably approximately correct (PAC) to the proposed PCMDP problem. The proposed algorithm is proved to achieve an $(\epsilon,p)$-PAC policy when the episode $K\geq\Omega(\frac{I^2H^6SA\ell}{\epsilon^2})$, where $S$ and $A$ are the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. We note that this is the first result on PAC kind of analysis for PCMDP with peak constraints, where the transition dynamics are not known apriori. We demonstrate the proposed algorithm on an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the studied optimization problem.
翻译:在优化动态系统时,变量通常会受到制约。这类问题可以模拟成一个控制马可夫决策过程(CMDP)的模型。本文件考虑了顶峰的马可夫决策过程(PCMDP ) 。在顶峰的马可夫决策过程(PCMDP ) 中,代理商选择了政策,以便在有限的地平线上最大限度地获得全部报酬,并满足每个时代的制约因素(概率 ) 1. 我们提出一个无模式的算法,将PCMDP问题转换成一个不受控制的问题,并采用基于Q学习的方法。我们定义了可能大致正确(PAC)于拟议的PCMDP问题的概念。 拟议的算法被证明能够实现美元(\ epsilon, p) $-PAC 的顶峰值政策, 当“K\geq\\\ Omega”(\ g) 以及“I2H6SA\ ell = Excalticloral” 时, 我们注意到,“PLQLO” 和“SLIA”的顶级成本分析结果。