At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of states with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called loop estimator that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of $O(1)$ when estimating the value of a single positive recurrent state $s$ unlike TD with $O(S)$ or model-based methods with $O\left(S^2\right)$. Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of $\widetilde{O}\left(\sqrt{\tau_s/T}\right)$ over steps $T$ on a single sample path, where $\tau_s$ is the maximal expected hitting time to state $s$. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.
翻译:在强化学习的折扣环境下,通常使用和研究的政策迭代算法的工作核心是强化学习,政策评价步骤估计国家的价值,其样本来自Markov奖赏过程。我们提议了一个简单而高效的估测器,称为环估算器,它利用Markov奖赏过程的再生结构,而没有明确估计一个完整的模型。我们的方法在估算单一抽样路径上一个正正的经常国美元值时,其空间复杂性为O(1)美元(1美元,这不同于以美元(S)为单位的TD美元,或以美元为模型为基础的方法。此外,再生结构使我们能够在不依赖基因化模型方法的情况下显示,该估计器具有以美元为单位的全方位的集成率(sqrt_tau_s/T ⁇ right),在单一样本路径上,以美元为单位的正值为单位的经常国元($=美元)或以美元为基数的模型。在初步数字实验中,以数字模型为标准,以数字模型为标准。