We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process, where the state transition matrix of the underlying environment stays unknown. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem. This algorithm achieves a total regret bound of $\tilde{\mathcal{O}}(D\sqrt{SAT})$\footnote{The symbol $\tilde{\mathcal{O}}$ means $\mathcal{O}$ with log factors ignored} in time horizon $T$ with $S$ states, $A$ actions and diameter $D$. DTS consists of two parts, the first part is the traditional part where we apply the posterior sampling method on transition matrix based on prior distribution. In the second part, we employ a count-based posterior update method to balance between the local optimal action and the long-term optimal action in order to find the global optimal game value. We established a regret bound of $\tilde{\mathcal{O}}(\sqrt{T}/S^{2})$. Which is by far the best regret bound for finite discounted Markov Decision Process to our knowledge. Numerical results proves the efficiency and superiority of our approach.
翻译:我们考虑在有限的折扣Markov决定程序下勘探和开发之间的权衡问题,在这个程序下,基本环境的状态过渡矩阵仍然未知。我们建议使用双倍的汤普森取样强化学习算法(DTS)来解决这类问题。这个算法实现了以美元为单位的彻底遗憾 $tilde_mathcal{(Dasqrt{SAT}) $\ footnote{ 符号 $\\ tdelde_mathcal{O}$ 和忽略的日志系数 $T$ 在时间跨度 $S$, $A$ 动作和直径$D$。 DTS 由两部分组成, 第一部分是传统部分, 我们根据先前的分配在过渡矩阵上应用海拔取样方法。 在第二部分,我们使用基于数数的后端更新方法来平衡当地最佳行动与长期最佳游戏值之间的平衡。 我们设定了以$\ mathcal call {O} (sqrock {(Sqrt{T}/SQalical) leveloplegest the restial legest legest the legal legest the legest the the restiquest the restiquen lemental pal pal legal pal pal restical lection lections)。