Meta-reinforcement learning algorithms provide a data-driven way to acquire policies that quickly adapt to many tasks with varying rewards or dynamics functions. However, learned meta-policies are often effective only on the exact task distribution on which they were trained and struggle in the presence of distribution shift of test-time rewards or transition dynamics. In this work, we develop a framework for meta-RL algorithms that are able to behave appropriately under test-time distribution shifts in the space of tasks. Our framework centers on an adaptive approach to distributional robustness that trains a population of meta-policies to be robust to varying levels of distribution shift. When evaluated on a potentially shifted test-time distribution of tasks, this allows us to choose the meta-policy with the most appropriate level of robustness, and use it to perform fast adaptation. We formally show how our framework allows for improved regret under distribution shift, and empirically show its efficacy on simulated robotics problems under a wide range of distribution shifts.
翻译:元加强学习算法提供了一种以数据驱动的方式获取政策,以迅速适应具有不同奖励或动态功能的许多任务。然而,所学的元政策往往只在他们接受培训和在测试时间奖励或过渡动态分布变化的情况下挣扎的确切任务分配方面有效。在这项工作中,我们为能够在任务空间试验时间分配变化下适当行事的元RL算法制定了一个框架。我们的框架以适应分配稳健性的方法为中心,该方法培训了一批元政策,以适应不同的分配转移水平。在对可能转移的测试时间任务分配进行评估时,这使我们能够以最适当的稳健度选择元政策,并利用它来进行快速适应。我们正式展示我们的框架如何在分配变化中改进遗憾,并用经验显示其在广泛的分配变化中模拟机器人问题上的功效。