Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical property is less understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
翻译:在评估批量强化学习质量方面,Bootstrapping提供了灵活有效的方法,但对其理论属性理解较少。在本文中,我们研究了在非政策性评估中使用靴子的情况,特别是我们侧重于在表格式和线性模型中已知最优的适合的Q评价(FQE ) 。我们提出了一种用于推算政策评价错误分布的踢步法,并表明这一方法在非政策性统计推论方面是同样有效且分布一致的。为了克服靴子穿透的计算限制,我们进一步调整了一个子抽样程序,使运行时间按数量顺序加以改进。我们从数字上评估典型的RL环境中的靴子评估方法,以进行信任时间间隔估计,估计非政策性评价人员的差异,并估计多个非政策性评价人员之间的关联性。