This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal $Q$-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional $\Omega\left(\frac{S}{\Delta_{min}}\right)$ regret due to over-exploration where $\Delta_{min}$ is the minimum sub-optimality gap and $S$ is the number of states. We further show that for general MDPs, AMB suffers an additional $\frac{|Z_{mul}|}{\Delta_{min}}$ regret, where $Z_{mul}$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for $s$. We complement our upper bound with a lower bound showing the dependency on $\frac{|Z_{mul}|}{\Delta_{min}}$ is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.
翻译:本文展示了一个新的无模型算法, 用于 Acssodic Limit- horizon Markov 决断进程( MDP), 适应性多步制导(AMB), 具有更强的偏差和依赖差的悔恨。 第一项创新是通过将乐观的靴子与适应性多步的蒙特卡洛的推出组合, 来估计最优的 Q美元 。 第二个创新是选择在不为任何其他动作所支配的可受理行动中具有最大置信间隔的动作。 我们显示, 当每个国家有独特的最佳行动时, AMB( MDP) 获得一个取决于差数的遗憾, 只有与亚最佳差差差之和相比, AMB( 2019) 和 Jamison (2019) 显示, 所有的上限(UCB) 算法都会受到额外的 $(megale) left) (\\\\\ Delta\\\\ min\ min\\\\\\\ right) right) 很遗憾, 因为过度的爆炸, $ (D) 任何亚值的亚值是最小的基值差距差距差距差距, 美元, 美元, 美元是最小的最小值(美元) 美元) 和美元。