We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $O(B_\star S\sqrt{AK})$, where $B_\star$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.
翻译:我们认为,对于Stochaistic Sortest Path (SSP) 问题的在线强化学习是作为吸收状态的未知的 MDP 模拟的。 我们提出PSRL- SSP,这是SSP 问题的简单后部强化学习算法。 算法在时代中运作。 在每一个时代的初期,从未知模型动态的后部分布中抽取一个样本,在这一时期遵循了对抽取样本的最佳政策。 如果当前时代对目标状态的访问次数超过前一个时代,或者访问任何州-行动对子的次数翻倍。 我们建立巴耶西亚对美元(B ⁇ star S\sqrt{A}}$的遗憾捆绑。 美元是最佳政策预期成本的上限, 美元是州空间的大小, 美元是行动空间的大小, 美元是第一位时段。 我们的算法只要求先前对美元(B ⁇ star S\qrt{A}) 的票数, 之前的票数的票数分配方式, 只需要之前的票数的票数法。