We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a $\gamma$-discounted MDP. We establish asymptotic normality for the averaged iteration $\bar{\boldsymbol{Q}}_T$. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is actually a regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ with the most efficient influence function. It implies the averaged Q-learning iteration has the smallest asymptotic variance among all RAL estimators. In addition, we present a non-asymptotic analysis for the $\ell_{\infty}$ error $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing it matches the instance-dependent lower bound as well as the optimal minimax complexity lower bound. As a byproduct, we find the Bellman noise has sub-Gaussian coordinates with variance $\mathcal{O}((1-\gamma)^{-1})$ instead of the prevailing $\mathcal{O}((1-\gamma)^{-2})$ under the standard bounded reward assumption. The sub-Gaussian result has potential to improve the sample complexity of many RL algorithms. In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
翻译:我们用一个 $\ gamma$, 平均 Q- leaning 来研究与 Polyak- Ruppert 平均( a. k. a., 平均 Q- leaning) 同步的 Q- 学习, 以美元计 mDP 。 我们为所有 AL 估计值中的平均迭代 $\ bar\ boldsymbol @ t$ 。 此外, 我们为 $\\ boldsymbol @ t$ 实际上是一个常规的 自动线性( bar\ boldsymol_ T\ boldsymbol_ inty} 用于优化 Q- boldsymball $ 和最有效的影响函数。 这意味着平均Q- 学习 Q- 校验的迭代值有最小的负值差异。 此外, 我们对 $\ ell_ ftybtyball\ a orma 标准 $. (L_ brown_ lax a pass a ass ass assal dal dal) 平均 roal roal roal roal roal 分析结果。