Reinforcement learning (RL) agents need to be robust to variations in safety-critical environments. While system identification methods provide a way to infer the variation from online experience, they can fail in settings where fast identification is not possible. Another dominant approach is robust RL which produces a policy that can handle worst-case scenarios, but these methods are generally designed to achieve robustness to a single uncertainty set that must be specified at train time. Towards a more general solution, we formulate the multi-set robustness problem to learn a policy robust to different perturbation sets. We then design an algorithm that enjoys the benefits of both system identification and robust RL: it reduces uncertainty where possible given a few interactions, but can still act robustly with respect to the remaining uncertainty. On a diverse set of control tasks, our approach demonstrates improved worst-case performance on new environments compared to prior methods based on system identification and on robust RL alone.
翻译:强化学习(RL) 代理器需要稳健, 以适应安全临界环境中的差异。 虽然系统识别方法提供了一种从在线经验中推断差异的方法,但在无法快速识别的情况下,这些方法也可能失败。 另一种主要方法是稳健的RL, 产生一种能够处理最坏情况的政策, 但是这些方法一般是为了实现稳健性, 形成一个在火车时间必须指定的单一的不确定性集。 为了找到一个更普遍的解决方案, 我们提出了多重设定的稳健性问题, 以学习一种对不同扰动组合具有稳健性的政策。 我们然后设计一种既受益于系统识别又具有稳健RL的算法: 在少数互动的情况下, 它可以减少不确定性, 但仍然能够针对剩余不确定性采取强有力的行动。 在一系列不同的控制任务中, 我们的方法表明, 相对于先前基于系统识别和仅基于稳健RL的方法的新环境中的最坏情况表现, 与以往的方法相比, 有了更好的改进。