We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The functional central limit theorem implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-learning without the Lipschitz condition.
翻译:我们用Polyak-Ruppert 平均在同步和表格设置的折扣式Markov 决策过程中学习Q。 在Lipschitz 条件下, 我们为平均滚动 $\bar\ boldsymbol ⁇ T$ 设定一个功能中央限值参数, 并显示其标准化的局部和进程与重新排序的布朗运动不协调。 功能中心限值意味着一个完全在线的强化学习推断方法。 此外, 我们显示, $\bar\ boldsymbol ⁇ T $ 是常规的静态线性( RAL) 估计值最佳Q- 值 $\ boldsymbol $ 具有最高效影响力功能的计算器 。 我们为$\ ell\ intyfty} 差错提供了非测试性分析 。 $\\\\\\\\\\\\\ bar\ boldsymbol\\ boldsymbol ⁇ fty} 显示, 显示它与多式步骤大小的低边框值( RAL) 显示它匹配的线性缩缩缩缩缩缩缩缩学习结果。 为不提供类似的磁学习结果。