We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a $\gamma$-discounted MDP. We establish a functional central limit theorem (FCLT) for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show its standardized partial-sum process weakly converges to a rescaled Brownian motion. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ with the most efficient influence function. This implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a non-asymptotic analysis for the $\ell_{\infty}$ error $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing for polynomial step sizes it matches the instance-dependent lower bound as well as the optimal minimax complexity lower bound. In short, our theoretical analysis shows averaged Q-learning is statistically efficient.
翻译:我们用一个 $\ k.a. a., 平均 Q- learning 来研究与 Polyak- Ruppert 平均同步的 Q 学习(a. a. a., 平均 Q- learning) 。 我们为平均循环 $\ Bar\ boldsymbol @T$ 建立一个功能中央限制理论( FCLT ), 并显示其标准化的局部和进程与重新排序的布朗运动不怎么吻合。 此外, 我们显示$\ ell\ incentyfty} 错误 $\\\ mathb{ E\\ bar asympol_ T- boldsymbol_ $ 用于优化 Q- 价值最佳的 Q- 值的常规线性估计( $\ boldsymbol $ $) 。 这意味着平均Q- 学习周期中, 平均 最小的 Q- 学习时间- 显示我们最低的统计平均 格式的模拟分析。