We initiate the study of multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system extending recent results for the special case of stochastic bandits. We provide a framework which modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on "optimism in the face of uncertainty", by complementing them with principles from "action elimination". Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms which (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels corruption, enjoying regret guarantees which degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) as well as linear-function-approximation settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee which accommodates any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
翻译:我们开始研究在对抗性腐败下,在奖赏和过渡概率两方面进行多阶段的附带强化学习,研究基础体系的过渡概率,扩大最近对随机强盗的特殊情况的结果;我们提供了一个框架,以“面对不确定性的乐观主义”为基础,以“行动消灭”的原则补充现有强化学习方法的侵略性探索。重要的是,我们的框架绕过了在RL环境中天真地采取消灭行动所带来的重大挑战,我们通过较低的约束性加以正式证明。我们的框架产生了高效的算法,这些算法(a) 在没有腐败的情况下,近乎最佳的遗憾,(b) 适应未知的腐败程度,享有遗憾的保证,在遇到的全部腐败中优于优雅的贬低。为了展示我们的方法的一般性,我们为表式环境(在州和行动是有限的地方)以及线性功能适应的适应环境(在动态和奖赏承认线性基本代表的情况下)以及线性功能适应线性调节环境的结果。 值得注意的是,我们的工作提供了第一个子线性遗憾保证,这种算出任何偏离纯粹i.d.d.d.d.d.d.fropping regresgressprevormmmul