Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\mathcal{O}(\sqrt{V_1^\star K})$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
翻译:获得第一阶的遗憾界限 -- -- 遗憾的界限不是最坏的情况,而是某个特定实例最佳政策业绩的某种程度的尺度 -- -- 是连续决策中的一个核心问题。虽然这种界限在许多环境中存在,但事实证明在与大型州空间进行强化学习时,这些界限是难以实现的。在这项工作中,我们解决了这一差距,并表明在加强与大型州空间(即线性 MDP 设置) 的学习时,有可能以美元(sqrt{V_1 ⁇ ztar K}) 来获得遗憾的缩放。在这里,V_1 ⁇ star$是最佳政策的价值,而$K$是事件的数量。我们证明,基于最低平方估计的现有技术不足以取得这一结果,而是在强大的Catoni 平均值(可能具有独立利益)的基础上开发出新的稳健的自我正常化集中。