We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $\Omega\left(H\sqrt{SAT}\right)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.
翻译:在强化学习中,我们研究使用随机值函数进行勘探的算法。这种算法具有令人着迷的经验性表现。我们显示,当我们使用1个单随机种子时,每集都有一个单一随机种子,和2个伯恩斯坦式的噪音规模时,我们获得最差的 $Umplite{O ⁇ left(H\qrt{SAT ⁇ right)$, 最差的 美元, 与对数因素相对应。 我们的结果突出表明, 随机化的探索可能接近最优化, 此前只有乐观的算法才能实现。 为了实现预期的结果, 我们开发了一个新的剪报操作, 以确保基于随机化值函数的算法的所有现有界限, 第一次, 匹配了 $\Omega\left(H\sqrt{SAT ⁇ right) 的最小值, 与对数值的对等值。 我们的结果突出显示, 随机化的勘探可能是近于最优化的, 此前只能通过乐观的算法实现。 为了达到预期的结果, 我们开发了一个新的剪裁剪裁作业, 1) 以确保乐观和绝对的重新分析公式的概率的概率, 。