We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with a unichain Markov Decision Process. At every interaction, the agent obtains a reward. Further, there are $K$ cost functions. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose CMDP-PSRL, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. Further, for MDP with $S$ states, $A$ actions, and diameter $D$, we prove that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by $\Tilde{O}(poly(DSA)\sqrt{T})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(poly(DSA)\sqrt{T})$. To the best of our knowledge, this is the first work which obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints.
翻译:我们考虑了限制的Markov决定程序的问题,即代理人与单链马可夫决定程序发生互动。在每次互动中,代理人都得到奖励。此外,还有美元的成本功能。代理人的目的是最大限度地提高长期平均报酬,同时将长期平均费用保持在低于某一阈值的水平以下。我们在本文件中提议CMDP-PSRL,这是一种基于事后抽样算法,代理人可以借此学习与CMDP互动的最佳政策。此外,对于美元、美元和直径美元的MDP,我们证明按照CMDP-PSRL算法,代理人可以将没有从最佳政策中积累收益的遗憾约束在美元/Tilde{(PO)}(美元/Sqrt{T)美元。此外,我们表明,对于任何以美元作为限制条件的违反行为也受美元/Tilde{(poly){(DSA)\qrt{T}美元约束。对于我们的知识中的最佳限制,这是我们获得MQQ_rrrrs的首项。