We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.
翻译:我们研究了Markov决策程序(MDPs)的因差异而异的遗憾界限;有因差异而异的遗憾保证的等级可以自动地利用差异较低的环境(例如,对确定性MDPs不断表示遗憾);现有的算法要么是因差异而异的,要么是不最优的;我们首先提出了两个新的环境规范,以说明细微差异的环境特性;对于基于模型的方法,我们设计了一个基于参考功能的算法的变种(Zhang等人,2021aa),并使用新的分析技术来显示这种算法在我们的拟议规范方面具有因差异而异的界限;特别是,这一界限同时对随机和确定性 MDPs这两个系统都具有最优化的最小值;我们进一步发起关于无模式的、因差异而异的算法的研究,方法是设计一种基于参考功能的变式算法,并配有新型的按上限调整的参考更新时间表。最后,我们还提供了更低的界限,以补充我们的上层。