The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.
翻译:最相关的折扣强化学习问题涉及在马尔可夫奖励过程的平稳分布下估计函数的均值,例如策略评估中的预期回报或策略优化中的策略梯度。在实践中,这些估计是通过有限时间段的一次性采样产生的,但这种方法忽略了马尔可夫过程的混合性质。目前尚不清楚这种实践和理想设置之间的不匹配如何影响估计,而且文献缺乏关于如何最优地进行一次性采样的正式研究和因此带来的风险。在本文中,我们提出了一个折扣均值估计问题的极小最大下限,明确将估计误差与马尔可夫过程的混合特性和折扣因子联系起来。然后,我们提出了针对一些重要估计量以及相应的采样程序的统计分析,其中包括实践中经常使用的有限时间段估计器。关键是,我们展示了直接从马尔可夫过程的折扣核进行抽样来估计平均值具有引人注目的统计性质,这与其他估计器匹配并且不需要仔细调整一次采样的持续时间。