The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.
翻译:折扣强化学习中最相关的问题涉及在马尔可夫奖励过程的平稳分布下估计函数的平均值,例如策略评估中的预期回报或策略优化中的策略梯度。在实践中,这些估计是通过有限的时序抽样产生的,它忽略了马尔可夫过程的混合属性。目前尚不清楚这种实践和理想设置之间的不匹配如何影响估计,并且文献缺乏关于时序抽样的缺陷及其如何最优地实现的形式化研究。在本文中,我们提供了一个关于折扣均值估计问题的最小最大下界,明确将估计误差与马尔可夫过程的混合属性和折扣因子联系起来。然后,我们对一些著名的估计器及其相应的抽样过程进行了统计分析,其中包括实践中经常使用的有限时序估计器。重要的是,我们展示了通过直接从马尔可夫过程的折扣内核进行抽样来估计均值比替代估计器带来了更强大的统计特性,因为它无需精细调整时间步数就可以匹配下限。