Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is less common, as RL models can be very sensitive to slight perturbation of the environment. The robust Markov decision process (MDP) framework -- in which the transition probabilities belong to an uncertainty set around a nominal model -- provides one way to develop robust models. While previous analysis shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires a careful balance between exploration and exploitation. In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. To address the additional uncertainty caused by an adversarial environment, our model features a new optimistic update rule derived via Fenchel conjugates. Our analysis establishes the first regret bound for online robust MDPs.
翻译:强化学习(RL)在许多合成环境(如视频游戏和Go)中超过了人的性能。然而,在视频游戏和Go等许多合成环境中,现实世界中端端至端RL模型的部署并不常见,因为RL模型对环境轻微扰动非常敏感。强大的Markov决策程序(MDP)框架 -- -- 其中过渡概率属于围绕名义模型的不确定性 -- -- 提供了开发稳健模型的一种方法。虽然以前的分析显示RL算法有效地假设可以使用基因化模型,但是仍然不清楚在更现实的在线环境中RL算法是否有效,这需要在探索和开发之间保持谨慎的平衡。在这项工作中,我们通过与未知的名义系统互动来考虑在线强大的MDP。我们提出了一种稳健的乐观的政策优化算法,这种算法非常有效。为了解决由对抗性环境造成的额外不确定性,我们的模型采用了一种通过Fenchel conjugates得出的新的乐观更新规则。我们的分析为在线强度的MDP提供了首度的遗憾。