Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.
翻译:受到约束的 Markov 决策程序( CMDPs ) 模式, 其顺序决策模式的假设在许多应用中越来越重要。 但是, 模型往往不为人知, 必须在网上学习, 同时仍能确保满足限制, 或者至少违规情况有时间的束缚。 最近的一些文件在这个非常具有挑战性的问题上取得了进展, 但是要么需要不满意的假设, 如对安全政策的了解, 或者累积性遗憾。 我们建议使用一个不需要这种假设,但运行良好的安全 PSRL( 以抽样为基础的其他RL) 算法, 无论是理论上的遗憾范围还是经验性。 算法通过使用后方取样原则在勘探和开发之间实现高效的交换, 并且仅仅通过利用悲观主义的理念而受到约束性限制。 我们的方法是以初步的策略为基础。 我们设置了一个亚线性 $\ talth{ Orft} $( h ⁇ 2.5} 美元( Sqrockal) {S\\\\\\\\\\\\\\ cal) laxalalalalalal a deal ortial a legal legal exal exal as a lectional legal lection a ex a $, $。 我们bal legal lemental legal $。 我们 legal legal_ $。 我们, 我们, 我们 $_ $_ $_ $。