It is believed that a model-based approach for reinforcement learning (RL) is the key to reduce sample complexity. However, the understanding of the sample optimality of model-based RL is still largely missing, even for the linear case. This work considers sample complexity of finding an $\epsilon$-optimal policy in a Markov decision process (MDP) that admits a linear additive feature representation, given only access to a generative model. We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver. We prove that under the anchor-state assumption, which implies implicit non-negativity in the feature space, the minimax sample complexity of finding an $\epsilon$-optimal policy in a $\gamma$-discounted MDP is $O(K/(1-\gamma)^3\epsilon^2)$, which only depends on the dimensionality $K$ of the feature space and has no dependence on the state or action space. We further extend our results to a relaxed setting where anchor-states may not exist and show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for RL.
翻译:据认为,基于模型的强化学习方法(RL)是降低抽样复杂性的关键。然而,即使对于线性案例,对基于模型的RL样本最佳性的理解在很大程度上仍然缺乏,即使对于线性案例也是如此。这项工作考虑了在Markov决策程序中找到美元=epsilon$-最佳政策的样本复杂性,在Markov决策过程中找到一个只允许使用基因模型的线性添加特征代表,而只允许使用基因模型的线性添加特征。我们通过插件求解器方法解决这个问题,通过任意的插件求解器在这种经验模型中建立经验模型和计划。我们证明,根据锚状状态假设,这意味着在地貌空间内隐含非强化性。我们进一步将我们找到美元=epsilon$-最佳政策的最小型样本复杂性,在$\gamma$-discoun MDP中找到一种线性添加特征特征代表,只有获得基因模型。我们通过插件空间的维度建立经验模型和不依赖状态或行动空间。我们进一步将我们的结果扩展到一个高效的模型,可以提供弹性的模型,从而显示一个基于螺旋式的模型。