Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the sampled tasks. This is a non-stationary process where the task distribution evolves along with agent policies, creating an instability over time. While past works demonstrated the potential of such approaches, sampling effectively from the task space remains an open challenge, bottlenecking these approaches. To this end, we introduce CLUTR: a novel curriculum learning algorithm that decouples task representation and curriculum learning into a two-stage optimization. It first trains a recurrent variational autoencoder on randomly generated tasks to learn a latent task manifold. Next, a teacher agent creates a curriculum by maximizing a minimax REGRET-based objective on a set of latent tasks sampled from this manifold. By keeping the task manifold fixed, we show that CLUTR successfully overcomes the non-stationarity problem and improves stability. Our experimental results show CLUTR outperforms PAIRED, a principled and popular UED method, in terms of generalization and sample efficiency in the challenging CarRacing and navigation environments: showing an 18x improvement on the F1 CarRacing benchmark. CLUTR also performs comparably to the non-UED state-of-the-art for CarRacing, outperforming it in nine of the 20 tracks. CLUTR also achieves a 33% higher solved rate than PAIRED on a set of 18 out-of-distribution navigation tasks.
翻译:强化学习( RL) 算法通常以抽样效率低和难以概括而闻名。 最近, 不受监督的环境设计( UED) 成为零光概括的新范式, 通过同时学习任务分布和代理政策来学习抽样任务。 这是一个非静止的过程, 任务分配会随着代理政策而演变, 并随着时间的推移造成不稳定。 虽然过去的工作展示了这种方法的潜力, 从任务空间有效取样仍是一个开放的挑战, 阻碍了这些方法。 为此, 我们引入了 CLUTR: 新的课程学习算法, 将任务表达和课程学习分为两个阶段。 它首先在随机生成的任务上培养经常性的自动变异编码器, 以学习潜在的任务组合。 接下来, 教师代理创造了一个课程, 最大限度地增加一个基于微小的 REGRET 目标, 从任务中抽取的一组潜在任务。 通过保持任务组合的固定, 我们显示 CLUTR成功克服了非固定性的问题, 并改进了稳定性。 我们的实验结果显示 CLUTR- Ralalia 超越了Car- Ral- 18 的常规环境, 显示C- Ral- Ral- Ral- Ral- real- real- 和FADE 显示一个不挑战性 18 Ral- real- real- real- real- dal- sal- dal- 和 Fal- disal- disal- disal- disal- dal- disal- disal- dis- disal- 和F- disald- disald- 和 Fal- disald- disal- 的18 a- 和制程 和制程 。