We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. For finite-horizon tabular MDPs, without prior knowledge on the total amount of corruption, our algorithm achieves a regret bound of $\widetilde{\mathcal{O}}(\min\{\frac{1}{\Delta}, \sqrt{T}\}+C)$ where $T$ is the number of episodes, $C$ is the total amount of corruption, and $\Delta$ is the reward gap between the best and the second-best policy. This is the first worst-case optimal bound achieved without knowledge of $C$, improving previous results of Lykouris et al. (2021); Chen et al. (2021); Wu et al. (2021). For finite-horizon linear MDPs, we develop a computationally efficient algorithm with a regret bound of $\widetilde{\mathcal{O}}(\sqrt{(1+C)T})$, and another computationally inefficient one with $\widetilde{\mathcal{O}}(\sqrt{T}+C)$, improving the result of Lykouris et al. (2021) and answering an open question by Zhang et al. (2021b). Finally, our model selection framework can be easily applied to other settings including linear bandits, linear contextual bandits, and MDPs with general function approximation, leading to several improved or new results.
翻译:我们开发了一种模式选择方法,以解决在过渡和奖励中以对抗性腐败进行强化学习的问题。对于有限的一分之四的表式 MDPs,在事先不了解腐败总量的情况下,我们的算法实现了一种最坏情况的最佳约束,没有事先了解C$,改善了Lykouris等人(2021年)、Chen等人(2021年)、Wu等人(2021年)的以往结果。对于Lis-horizon线性 MDPs,我们开发了一种计算高效算法,其代价是$(全)tilde_mathcal{O}(Sqrt{(1+C)}美元,这是最佳和第二最佳政策之间的奖励差距。这是第一个最坏情况,没有事先了解C$,改进了Lykouris等人(2021年)、Chen等人(2021年)、Wu等人(2021年)的以往结果。对于Ly-horizon线性 MDPs,我们开发了一个计算高效算法的计算方法,以美元为全局范围Trual_bal_bal_al_al_al_al_al_al_al_al_al_al_al_b yal_al_al__b leb_____b_________b___b_b lexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx