We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs). While, similar to prior work (e.g., for ergodic MDPs), the lower-bound is the solution to an optimization problem, our derivation reveals the need for an additional constraint on the visitation distribution over state-action pairs that explicitly accounts for the dynamics of the MDP. We provide a characterization of our lower-bound through a series of examples illustrating how different MDPs may have significantly different complexity. 1) We first consider a "difficult" MDP instance, where the novel constraint based on the dynamics leads to a larger lower-bound (i.e., a larger regret) compared to the classical analysis. 2) We then show that our lower-bound recovers results previously derived for specific MDP instances. 3) Finally, we show that, in certain "simple" MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all. We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
翻译:我们得出了一个新颖的无症状问题,在有限偏差表单式Markov 决策程序(MDPs ) 中,我们通过一系列说明不同 MDP可能具有显著不同复杂性的示例来描述我们较低界限的特征。 1 我们首先考虑一个“困难” MDP实例,与以前的工作相似(例如,对于egodic MDPs),与以前的工作相似,低界限是优化问题的解决方案,但我们的推算表明,对于州-州行动对配对的访问分布,需要额外的限制,以明确说明MDP的动态。 3 最后,我们表明,在某些“简单” MDPs 中,较低界限大大小于一般案例,与所有最低行动差距相比,我们首先考虑一个“困难” MDP实例,即与传统分析相比,基于动态的新限制导致更大约束(即更大的遗憾),是优化问题。 2 我们然后表明,我们较低的回收结果先前产生于特定的MDP情况。 3 最后,我们表明,在某些“简单” MDPsomes, 较低界限比一般案例要小得多,而且与一般情况下没有达到最低行动差距。