We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter (in contrast to previous minimax approaches) and thus, truly quantify the fundamental limit of performance achievable by any learning algorithm. Deterministic MDPs can be interpreted as graphs and analyzed in terms of their cycles, a fact which we leverage in order to identify a class of deterministic MDPs whose regret lower bound can be determined numerically. We further exemplify this result on a deterministic line search problem, and a deterministic MDP with state-dependent rewards, whose regret lower bounds we can state explicitly. These bounds share similarities with the known problem-specific bound of the multi-armed bandit problem and suggest that navigation on a deterministic MDP need not have an effect on the performance of a learning algorithm.
翻译:我们认为Markov决策程序(MDPs)具有决定性的转变,并研究尽量减少遗憾的问题,这是分析和设计最佳学习算法的核心。我们提出了明确依赖系统参数(与以往的迷你算法不同)的对数问题特有的遗憾下限,从而真正量化了任何学习算法所能达到的业绩基本限度。确定性MDP可以被解释为图表,并按其周期加以分析,我们利用这个事实来确定某一类确定性决定性MDP,其遗憾下限可以用数字方式确定。我们进一步将这一结果以确定性线搜索问题作为例证,而确定性MDP则有依赖国家的奖赏,我们可以明确说明其遗憾下限。这些约束与已知的多臂强盗问题的特定问题有相似之处,并表明对确定性MDP的导航不必对学习算法的绩效产生影响。