In real-world tasks, reinforcement learning (RL) agents frequently encounter situations that are not present during training time. To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations. The robust RL framework addresses this challenge via a worst-case optimization between an agent and an adversary. Previous robust RL algorithms are either sample inefficient, lack robustness guarantees, or do not scale to large problems. We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem while attaining near-optimal sample complexity guarantees. RH-UCRL is a model-based reinforcement learning (MBRL) algorithm that effectively distinguishes between epistemic and aleatoric uncertainty and efficiently explores both the agent and adversary decision spaces during policy learning. We scale RH-UCRL to complex tasks via neural networks ensemble models as well as neural network policies. Experimentally, we demonstrate that RH-UCRL outperforms other robust deep RL algorithms in a variety of adversarial environments.
 翻译:在现实世界的任务中,强化学习(RL)代理商经常遇到在培训期间没有出现的情况。为了确保可靠的业绩,RL代理商需要表现出针对最坏情况的稳健性。强力的RL框架通过代理商和对手之间的最坏情况优化来应对这一挑战。以前的强力RL算法要么抽样低效,要么缺乏稳健性保障,要么不至于大规模问题。我们建议强力的Halluced 上不自律RL(RH-UCRL)算法在达到近于最佳的样本复杂性保障的同时,可以解决这一问题。RH-UCRL是一种基于模型的强化学习(MBRL)算法,有效地区分了一个代理商和对手在政策学习期间的决策空间之间的误差。我们通过神经网络混合模型以及神经网络政策将RH-UCRL(RCL)比其他强大的深层次RL算法在各种对抗性环境中。我们实验性地证明,RH-UCRCL(RL)比其他强大的深层次的演算法。