We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.
翻译:在表格设置下,我们研究了贴现的Markov 决策程序(MDPs)的强化学习问题。我们提议了一个基于模型的算法,名为UCBVI-$\gamma$,该算法基于面对不确定原则时的mph{optimism} 和Bernstein型奖金。我们显示,UCBVI-$\gamma$至少能达到$tilde{O ⁇ gmag{Oqrt{SAT}{{{{{{{(1-\gama)}}}{{(1-\gamma}}}} ⁇ 1.5 ⁇ ⁇ mbig$,而美元是州数,$A$是行动数,$\gamma$是折扣系数,$T$是步骤数。此外,我们建造了一组硬的MDPS,并表明对于任何算法,预期的遗憾是至少$tilde{Omegab*{{(1-\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\