Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
翻译:获得第一阶的遗憾界限 -- 遗憾界限的缩放不是最坏的情况,而是对某个特定实例最佳政策表现的某种程度的缩放度 -- 是连续决策中的一个核心问题。 虽然在许多环境下存在这样的边框, 但它们在与大型国家空间的强化学习中还是难以找到的。 在这项工作中, 我们解决了这一差距, 并表明有可能以$\ lobaltilde\mathcal{O} (\\ sqrt{d}3\ cddit V_ 1 ⁇ star K} + d ⁇ 3.5} H3\log K) 来获得遗憾缩放, 用于加强与大型国家空间( 即线性 MDP 设置) 的学习。 $V_ 1 ⁇ star $是最佳政策的价值, $K$( $) 是事件的数量。 我们证明基于最低方位估计的现有技术不足以获得这一结果, 而不是根据可能具有独立兴趣的坚固的Catoni 平均估测算器, 开发一种新的稳的自我正常化的浓度。