Meta-reinforcement learning (RL) can be used to train policies that quickly adapt to new tasks with orders of magnitude less data than standard RL, but this fast adaptation often comes at the cost of greatly increasing the amount of reward supervision during meta-training time. Offline meta-RL removes the need to continuously provide reward supervision because rewards must only be provided once when the offline dataset is generated. In addition to the challenges of offline RL, a unique distribution shift is present in meta RL: agents learn exploration strategies that can gather the experience needed to learn a new task, and also learn adaptation strategies that work well when presented with the trajectories in the dataset, but the adaptation strategies are not adapted to the data distribution that the learned exploration strategies collect. Unlike the online setting, the adaptation and exploration strategies cannot effectively adapt to each other, resulting in poor performance. In this paper, we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any ground truth reward labels, to bridge this distribution shift problem. Our method uses the offline data to learn the distribution of reward functions, which is then sampled to self-supervise reward labels for the additional online data. By removing the need to provide reward labels for the online experience, our approach can be more practical to use in settings where reward supervision would otherwise be provided manually. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional data and self-generated rewards significantly improves an agent's ability to generalize.
翻译:元加强学习( RL) 可用于培训政策, 快速适应新任务, 其规模比标准RL少, 数据数量小于标准RL, 但这种快速适应往往以大幅提高元培训时间的奖励监管量为代价, 下线元加强学习( RL) 取消了持续提供奖励监督的必要性, 因为奖励只能在离线数据集生成时提供一次。 除了离线的 RL 挑战外, 元( RL ) 中还存在独特的分配转移: 代理商学习探索战略, 以收集学习新任务所需的经验, 学习适应战略, 在数据集中随轨显示效果良好, 但适应战略没有适应所学的探索战略收集的数据分配量。 与在线设置不同, 适应和探索战略无法有效适应对方, 导致业绩不佳。 在本文中, 我们建议使用离线混合的 Met- RL 算算算算法, 可以用离线数据来获取一种适应政策的奖励, 然后收集更多的不严密的在线数据, 不需要任何地面的监管比对数据进行评级的标签, 而调整战略不会适应, 将数据转换为在线数据分配过程。 我们使用一种方法, 将数据转换为在线分配方法, 学习一种在线的自我分配方法, 学习一种方法, 学习一种在线的自我分配方法, 学习一种方法, 以便学习一种在线的自我分配方法, 将数据转换成一种自我 将数据转换成一种方法, 学习一种方法, 将数据转换成一种方法, 将数据转换成一种方法, 将数据转换成一种方法, 将数据转换成一种方法。