Various algorithms in reinforcement learning exhibit dramatic variability in their convergence rates and ultimate accuracy as a function of the problem structure. Such instance-specific behavior is not captured by existing global minimax bounds, which are worst-case in nature. We analyze the problem of estimating optimal $Q$-value functions for a discounted Markov decision process with discrete states and actions and identify an instance-dependent functional that controls the difficulty of estimation in the $\ell_\infty$-norm. Using a local minimax framework, we show that this functional arises in lower bounds on the accuracy on any estimation procedure. In the other direction, we establish the sharpness of our lower bounds, up to factors logarithmic in the state and action spaces, by analyzing a variance-reduced version of $Q$-learning. Our theory provides a precise way of distinguishing "easy" problems from "hard" ones in the context of $Q$-learning, as illustrated by an ensemble with a continuum of difficulty.
翻译:强化学习中的各种算法在汇合率和最终准确性上都表现出巨大的差异,这是问题结构的一个函数。这种具体实例的行为没有被现有全球迷你最大边框所捕捉,这些边框在性质上是最坏的。我们用离散的状态和行动来分析为贴现的Markov决策程序估计最佳美元价值功能的问题,并找出一个能控制$@ ⁇ infty-norm 估算难度的基于实例的功能。我们使用一个本地的迷你框框架来显示,这种功能是在任何估算程序的准确性的较低界限中产生的。在另一个方向上,我们通过分析差异化的美元学习版本,确定我们较低的边框的清晰度,到州和行动空间的对数值。我们的理论提供了一种精确的方法,将“容易”的问题与“硬”问题区别在$Q-norm的范畴内。我们用一个具有连续困难的组合来说明。