We study the fundamental question of the sample complexity of learning a good policy in finite Markov decision processes (MDPs) when the data available for learning is obtained by following a logging policy that must be chosen without knowledge of the underlying MDP. Our main results show that the sample complexity, the minimum number of transitions necessary and sufficient to obtain a good policy, is an exponential function of the relevant quantities when the planning horizon $H$ is finite. In particular, we prove that the sample complexity of obtaining $\epsilon$-optimal policies is at least $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H+1)})$ for $\gamma$-discounted problems, where $\mathrm{S}$ is the number of states, $\mathrm{A}$ is the number of actions, and $H$ is the effective horizon defined as $H=\lfloor \tfrac{\ln(1/\epsilon)}{\ln(1/\gamma)} \rfloor$; and it is at least $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$ for finite horizon problems, where $H$ is the planning horizon of the problem. This lower bound is essentially matched by an upper bound. For the average-reward setting we show that there is no algorithm finding $\epsilon$-optimal policies with a finite amount of data.
翻译:我们研究在有限的马尔可夫决策程序中学习良好政策的抽样复杂性这一根本问题。 当学习可用数据时,要学习的数据必须遵循不了解基本MDP而必须选择的伐木政策。 我们的主要结果显示,当规划前景($H$)有限时,抽样复杂性、最起码的过渡次数和足以获得良好政策的最低数目,是相关数量的指数函数。 特别是,我们证明,获得美元( 1/\ epsilon) 最佳政策的抽样复杂性至少是 $( mathrm) {A} min( mathrm{ S}-1, H+1}} 美元( ) 。 对于$( gamma) $( mathrm{ S} $) 被忽略的问题, 州的数目是 $( mathrm) { A} 美元( 美元) 是行动数目的指数。 $( h$) 底值( 1/ reformall) 定义有效的地平面( 1/\\ gama) leax leal legal pal proom leas mal pal pal leas mal leas mess leas mess leas mill leas mill lement lemental lemental $.