打破妨碍最坏和最佳无模式的示范加强学习的复杂程度障碍 (Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning)

from arxiv, Short version in Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021); Full version in Information and Inference: A Journal of the IMA

Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves -- by at least a factor of $S^5A^3$ -- upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called {\em reference-advantage decomposition}), the proposed algorithm employs an {\em early-settled} reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.

翻译：实现在线肾上腺强化学习(RL)的样本效率,需要优化平衡勘探和开发。当涉及到与美元(S$)和美元(A$)的行动和地平线长度(H$)一起的限量和离子体偏差(OL),实现在线肾上腺优化学习(RL)的样本效率需要最佳平衡勘探和开发。当涉及到以美元(modlolog log因数)的顺序来优化对在线表面强化学习(RLL)时,已经取得了显著的进展。尽管已经提出了几个相互竞争的解决方案范式,以最大限度地减少糖价(SA) 效率不高,或者低于优化,除非样本规模超过一个巨大的阈值(例如,$6A4\\\\\\\\\\\\\\\\\美元)和地平平线长度(H),现有的无模式方法。要克服如此大的样本规模障碍,我们要设计一个新的无型算法,有空间复杂性(SAHAH),随着抽样规模超过美元(QSA)的顺序, 更低的参考参考(Q) 的参考值(O\\\\\\\\\\xxxxxrdededededealdealdealdealdealdedeal) 需要,任何成本(Rxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx