We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an \textit{adaptive} adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most $\epsilon$-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than $O(\epsilon)$-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an $O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.
翻译:我们研究了在对立腐败情况下在奖赏和过渡方面的强力强化学习问题。 我们的攻击模式假设了一个可以任意腐蚀奖赏和过渡的对手,这个对手可以任意地在每集的每一步中腐蚀奖赏和过渡,因为最多可以破坏学习过程。 我们的攻击模式严格地说比以前工作中考虑的强。 我们的第一个结果显示,在攻击模式下,任何算法都找不到比$O( epsilon) $-最优的政策更好的。 其次,我们证明自然政策梯度(NPG)方法令人惊讶的是,如果奖赏腐败受到约束,自然政策梯度(NPG)方法保留了自然稳健的财产,并且可以找到美元(sqrt ~ epsilon}) 美元的最佳政策。 因此,我们开发了一个过滤式政策梯级(FPG) 算法(FPG) 算法可以容忍甚至无限制的腐败, 并且能找到美元( $O( epsilon) $- 4} 最优政策。 我们强调,FPG是第一个在不断腐败的情况下能够实现有意义的学习保证。