Meta-reinforcement learning (RL) can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We do not want to remove this distributional shift by simply adopting a conservative exploration strategy, because learning an exploration strategy enables an agent to collect better data for faster adaptation. Instead, we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks.
翻译:元加强学习( RL) 可以 元加强学习( REL) 的元培训政策, 适应数量比标准 RL 更少的数据数量级新任务的新任务, 但元培训本身成本高昂且耗时。 如果我们能在离线数据上进行元培训, 那么我们可以重新使用相同的静态数据集, 标记为对不同任务的一种奖励, 标记为对元测试时间适应各种新任务的元培训政策。 尽管这一能力将使元RL成为现实世界使用的一个实用工具, 离线的超级 RL 将带来更多的挑战, 离线的超级 RL 将带来比在线MER 或标准离线的 RL 设置更多的挑战。 Met- RL 学会了为调适而收集数据以适应新任务而快速适应数据的探索战略。 由于这一政策在固定的、 离线性数据集上进行了元培训, 它可能会在适应从常规勘探战略收集的数据时, 系统地发现与离线数据不同, 并由此导致分配的变换。 我们不想通过简单采用保守的在线升级的里程里程里程里程里程里程里程的里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程里程任务, 学任务,, 因为学习一个比的里程里程里程数据采集的里程里程里程里程数据采集的里程数据采集的里程数据采集的里程的里程数据采集的里程数据采集的里程数据采集的里程数据采集的里程性能能能学性能学性能能性能学任务。