Markov 评分过程中折扣值的循环模拟器 (Loop Estimator for Discounted Values in Markov Reward Processes)

At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of state with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called \emph{loop estimator} that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of $O(1)$ when estimating the value of a single positive recurrent state $s$ unlike TD with $O(S)$ or model-based methods with $O\left(S^2\right)$. Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of $\widetilde{O}\left(\sqrt{\tau_s/T}\right)$ over steps $T$ on a single sample path, where $\tau_s$ is the maximal expected hitting time to state $s$. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.

翻译：在减价强化学习中通常使用和研究的政策迭代算法工作核心,政策评价步骤估计国家的价值,用Markov 奖赏过程的样本从Markov 政策在Markov 决策过程中遵循Markov 政策引出的Markov 奖赏过程的样本中估算。我们提出一个简单而高效的估测器,名为 emph{loop spestmator},它利用Markov 奖赏过程的再生结构,而不明确估计一个完整的模型。我们的方法在估算单一样板路径上单正常常态美元(美元)的金额时,其空间复杂性为O(1)美元(美元),而美元(美元)与美元(S%2\right)不同的是,或以美元为模型为基础的方法。此外,再生结构使我们能够在不依赖基因化模型方法的情况下显示,该估测器具有以美元为全基的超级组合率率。在单一样板路径上,以美元为基点的单位(美元为美元)计算出一个正态经常状态(美元)的美元(美元)或以模型为基础的方法的美元(美元)的模型到州(Sqour2\right)值(Sex-stexeximstrat)的模型到美元,这种初步实验中,这是一个无标准的方法。