Curriculum Reinforcement Learning (CRL) aims to create a sequence of tasks, starting from easy ones and gradually learning towards difficult tasks. In this work, we focus on the idea of framing CRL as interpolations between a source (auxiliary) and a target task distribution. Although existing studies have shown the great potential of this idea, it remains unclear how to formally quantify and generate the movement between task distributions. Inspired by the insights from gradual domain adaptation in semi-supervised learning, we create a natural curriculum by breaking down the potentially large task distributional shift in CRL into smaller shifts. We propose GRADIENT, which formulates CRL as an optimal transport problem with a tailored distance metric between tasks. Specifically, we generate a sequence of task distributions as a geodesic interpolation (i.e., Wasserstein barycenter) between the source and target distributions. Different from many existing methods, our algorithm considers a task-dependent contextual distance metric and is capable of handling nonparametric distributions in both continuous and discrete context settings. In addition, we theoretically show that GRADIENT enables smooth transfer between subsequent stages in the curriculum under certain conditions. We conduct extensive experiments in locomotion and manipulation tasks and show that our proposed GRADIENT achieves higher performance than baselines in terms of learning efficiency and asymptotic performance.
翻译:在这项工作中,我们建议GRADIENT, 将CRL设计成一个最佳运输问题,在任务之间的距离度量标准。具体地说,我们产生了一系列任务分配,作为源和目标分配之间的大地分解(即瓦塞斯坦路心),在某种条件下,我们从许多现有方法不同,我们算法考虑的是取决于任务的背景距离度量度,能够处理连续和离散环境中的非对称分布分布。我们从理论上说,GRADIENT使得CRADIENT能够顺利地在以后各阶段之间转移,在某种条件下,在高水平的实验中,我们进行高水平的学习。