In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for finding an $\varepsilon$-optimal policy when $\varepsilon$ is sufficiently small. This is the first theoretical result that demonstrates that a simple model-free algorithm without variance-reduction can be nearly minimax-optimal under the considered setting.
翻译:在这项工作中,我们用一种基因模型来考虑和分析无模型强化学习的抽样复杂性。特别是,我们分析了Geist等人(2019年)和Vieillard等人(2020年a)的镜像下沉值迭代(MDVI),后者在其价值和政策更新中使用了Kullback-Libel差异和对星盘的正规化。我们的分析表明,当$@varepsilon$-最佳政策足够小时,它几乎是迷你式的最佳方法。 这是第一个理论结果,表明在考虑的环境下,一个简单的无模型的、没有差异的算法几乎可以达到最小型的顶点。