We study a posterior sampling approach to efficient exploration in constrained reinforcement learning. Alternatively to existing algorithms, we propose two simple algorithms that are more efficient statistically, simpler to implement and computationally cheaper. The first algorithm is based on a linear formulation of CMDP, and the second algorithm leverages the saddle-point formulation of CMDP. Our empirical results demonstrate that, despite its simplicity, posterior sampling achieves state-of-the-art performance and, in some cases, significantly outperforms optimistic algorithms.
翻译:我们研究后方取样方法,以有效探索受限制的强化学习。除了现有的算法之外,我们还提出两种简单的算法,在统计上更有效率,更简单,更便于执行,计算更便宜。 第一种算法以CMDP线性配方为基础,而第二种算法则利用CMDP的马鞍式配方。 我们的经验结果表明,尽管后方取样简单,但它还是取得了最先进的性能,在某些情况下,明显优于乐观的算法。