A wide range of reinforcement learning (RL) problems - including robustness, transfer learning, unsupervised RL, and emergent complexity - require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent's learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent's return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.
翻译:一系列广泛的强化学习(RL)问题,包括稳健性、转移学习、不受监督的RL和突发的复杂性,要求具体确定需要培训政策的任务或环境分布。然而,创造有用的环境分布容易出错,需要大量开发时间和精力。我们提议以不受监督的环境设计(UED)作为替代模式,开发者提供环境的参数未知,这些参数被用来自动产生有效、可溶性环境的分布。自动产生环境的现有方法受到常见的失败模式的影响:域随机化无法产生结构或使环境的难度适应代理人的学习进展,而小型对称对抗性培训导致最坏的环境,而这种环境往往无法解决。要为我们的主角代理创造结构化的、可溶性的环境,我们提出第二个与环境创造对手相联的敌对因素。对手的动机是创造出一种最深层的遗憾环境,其定义是主角与反对者之间的差别。我们称之为技术推进者之间的环境结构,我们把环境环境的难度调整技术称为:我们的技术推进者造反,我们研发者们在高水平环境上进行着的实验,我们研发者们的高级环境的高级环境。