A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability whenever test tasks are not known in advance. In this work, we propose a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL often leads to both biased gradients and data inefficiency. We prove that the former disappears in MRL, and address the latter via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML learns substantially different meta-policies and achieves robust returns on several navigation and continuous control benchmarks.
翻译:在现实应用中,强化学习(RL)的一个主要挑战是环境、任务或客户之间的差异。Meta-RL(MRL)通过学习适应新任务的元政策来解决这一问题。标准MRL方法优化了任务的平均回报率,但往往在高风险或困难的任务中遭受不良结果。每当测试任务事先不为人知时,这限制了系统的可靠性。在这项工作中,我们提出了一个强有力的MRL目标,具有受控制的稳健程度。优化RL的类似强健目标往往导致偏差梯度和数据效率低下。我们证明前者在MRL中消失,并通过新颖的Robust Met RL算法(ROML)处理后者。ROML是一种元数据,通过在整个培训中识别和过度标注较难的任务,生成任何给定的MRL算法的稳健的版本。我们证明,RML在整个培训中学习了完全不同的元政策,并在若干导航和连续控制基准上获得稳健的回报。