We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-learning) in a $\gamma$-discounted MDP. We establish a functional central limit theorem (FCLT) for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show its standardized partial-sum process converges weakly to a rescaled Brownian motion. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ with the most efficient influence function. This implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a nonasymptotic analysis for the $\ell_{\infty}$ error $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that for polynomial step sizes it matches the instance-dependent lower bound as well as the optimal minimax complexity lower bound. In short, our theoretical analysis shows that averaged Q-learning is statistically efficient.
翻译:我们用一个 $\ k.a. a., 平均 Q- learning, 平均 Q- 学习) 来研究以 Polyak- Ruppert 平均( a. k. a., 平均 Q- learning) 同步的Q 学习。 我们为平均循环 $\ bbar\ boldsymbol @ t$ 建立一个功能中央限制理论( FCLT ), 并显示其标准化的局部和进程与重新排序的布朗运动不相匹配。 此外, 我们显示$\ ell\ incentyfty} 错误 $\ mathbb{ E\\ bar ymproom- t- boldsymall 线性( RAL) 估测器, 以最有效的 Q\ boldsymbol $ 。 这意味着平均 Q- 学习周期性平均 显示我们最低的统计分析, 以最低的 格式显示我们最低的 的 格式化的 。