This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal $Q$-function. We prove that the optimistic $Q$-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.
翻译:本文展示了第一个非抽象结果, 显示无模型算法可以实现对单表式强化学习的对数累积遗憾, 如果在最佳的美元功能中存在绝对正的亚最佳差值。 我们证明[ Jin 等2018年] 所研究的乐观的 Q$ 学习为$mathcal{O ⁇ left (\fraft) (fraft) (H\right) =Delta ⁇ min ⁇ log\left (SAT\right)\right)$ 累积遗憾绑定, 美元是州数, $A是行动的数量, $H是规划的地平线, $T$是步骤的总数, $\Delta ⁇ min} 是最小的亚最佳差值。 这与以 $S, A, T$ 和 $较低理论上约束的 $log\left (SA\right) 系数相匹配。 我们进一步将我们的分析扩展到折扣设置并获得类似的对数累积的对数。