We investigate statistical uncertainty quantification for reinforcement learning (RL) and its implications in exploration policy. Despite ever-growing literature on RL applications, fundamental questions about inference and error quantification, such as large-sample behaviors, appear to remain quite open. In this paper, we fill in the literature gap by studying the central limit theorem behaviors of estimated Q-values and value functions under various RL settings. In particular, we explicitly identify closed-form expressions of the asymptotic variances, which allow us to efficiently construct asymptotically valid confidence regions for key RL quantities. Furthermore, we utilize these asymptotic expressions to design an effective exploration strategy, which we call Q-value-based Optimal Computing Budget Allocation (Q-OCBA). The policy relies on maximizing the relative discrepancies among the Q-value estimates. Numerical experiments show superior performances of our exploration strategy than other benchmark policies.
翻译:我们调查了用于强化学习的统计不确定性量化及其在勘探政策中的影响。尽管关于强化学习(RL)应用的文献越来越多,但关于推断和错误量化的基本问题,例如大型抽样行为,似乎仍然相当开放。在本文件中,我们通过研究各种RL设置下估计的Q值和价值函数的中央极限理论行为来填补文献空白。特别是,我们明确查明了无症状差异的封闭形式表达方式,这使我们能够有效地为关键的RL应用建立无症状的有效信任区。此外,我们利用这些无症状表达方式设计有效的勘探战略,我们称之为基于Q价值的优化计算预算分配(Q-OCBA),该政策依靠的是最大限度地扩大Q值估计之间的相对差异。数字实验表明,我们的勘探战略比其他基准政策表现优。