The theory of reinforcement learning currently suffers from a mismatch between its empirical performance and the theoretical characterization of its performance, with consequences for, e.g., the understanding of sample efficiency, safety, and robustness. The linear quadratic regulator with unknown dynamics is a fundamental reinforcement learning setting with significant structure in its dynamics and cost function, yet even in this setting there is a gap between the best known regret lower-bound of $\Omega_p(\sqrt{T})$ and the best known upper-bound of $O_p(\sqrt{T}\,\text{polylog}(T))$. The contribution of this paper is to close that gap by establishing a novel regret upper-bound of $O_p(\sqrt{T})$. Our proof is constructive in that it analyzes the regret of a concrete algorithm, and simultaneously establishes an estimation error bound on the dynamics of $O_p(T^{-1/4})$ which is also the first to match the rate of a known lower-bound. The two keys to our improved proof technique are (1) a more precise upper- and lower-bound on the system Gram matrix and (2) a self-bounding argument for the expected estimation error of the optimal controller.
翻译:强化学习理论目前因经验性业绩与对业绩的理论定性不匹配而存在脱节,例如,对抽样效率、安全和稳健性的理解。具有未知动态的线形四级监管机构是一个基本的强化学习基础,其动态和成本功能的结构相当,具有重要的结构,但即使在这一背景下,最著名的低级遗憾($Omega_p(sqrt{T})$)与最著名的高限($O_p(sqrt{T}))美元)之间也存在差距。本文的贡献是缩小这一差距,建立一个新的遗憾上限为$O_p(sqrt{T})。我们的证据很有说服力,因为它分析了具体算法的遗憾,同时确定了对美元($_p(T ⁇ -1/4})的动态的估计错误,这也是第一个与已知的低限率相匹配的。我们改进证据技术的两个关键是:(1) 一个更精确的上层和下层的自我定位矩阵,这是一个更精确的、更低层的模型。