The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space $\mathcal{S}$ and the action space $\mathcal{A}$ are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with $|\mathcal{S}|\times|\mathcal{A}|$, which can be prohibitively large when $\mathcal{S}$ or $\mathcal{A}$ is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp.$~$Q-learning) provably learns an $\varepsilon$-optimal policy (resp.$~$Q-function) with high probability as soon as the sample size exceeds the order of $\frac{K}{(1-\gamma)^{3}\varepsilon^{2}}$ (resp.$~$$\frac{K}{(1-\gamma)^{4}\varepsilon^{2}}$), up to some logarithmic factor. Here $K$ is the feature dimension and $\gamma\in(0,1)$ is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when $K$ is relatively small, and hence the title of this paper.
翻译:维度的诅咒是加固学习( RL) 中广为人知的一个问题。 在表格设置中, 州空间$\ mathcal{S}$ 和动作空间$\ mathcal{A}$都是有限的, 要获得接近最佳的政策, 抽样使用基因模型, 小型最佳样本复杂度是线性, 以$\ mathcal{S ⁇ tims\\ mathcal{A} $, 当$\ mathcal{S} 或$\ mathclus Explical{A} 美元是巨大的时, 国家空间空间 $\ macal$ 美元 和$kmac_\\\\\ mac\\\ max% maxl_ max% maxl maxl maxal_maxl maxal_max max max max max max__ max_ max max max_max max max_ max_ max_ max_ max_ max_ max_ max_ max_ max_ max_ max_ max_ maxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx