Agent decision making using Reinforcement Learning (RL) heavily relies on either a model or simulator of the environment (e.g., moving in an 8x8 maze with three rooms, playing Chess on an 8x8 board). Due to this dependence, small changes in the environment (e.g. positions of obstacles in the maze, size of the board) can severely affect the effectiveness of the policy learnt by the agent. To that end, existing work has proposed training RL agents on an adaptive curriculum of environments (generated automatically) to improve performance on out-of-distribution (OOD) test scenarios. Specifically, existing research has employed the potential for the agent to learn in an environment (captured using Generalized Advantage Estimation, GAE) as the key factor to select the next environment(s) to train the agent. However, such a mechanism can select similar environments (with a high potential to learn) thereby making agent training redundant on all but one of those environments. To that end, we provide a principled approach to adaptively identify diverse environments based on a novel distance measure relevant to environment design. We empirically demonstrate the versatility and effectiveness of our method in comparison to multiple leading approaches for unsupervised environment design on three distinct benchmark problems used in literature.
翻译:使用强化学习(RL)进行决策的代理商使用强化学习(RL)在很大程度上依赖环境模型或模拟器(例如,在8x8迷宫中移动8x8,有三个房间,在8x8板上玩棋子)。由于这种依赖性,环境的小变化(例如,迷宫障碍的位置,董事会的大小)会严重影响代理商所学到的政策的有效性。为此,现有工作提议在环境适应性课程(自动生成)方面培训RL代理商,以提高在分配外测试情景(OOOD)方面的绩效。具体地说,现有研究利用了代理人在环境中学习的潜力(利用通用的Advantage刺激,GAE),作为选择下一个环境来培训代理商的关键因素。然而,这种机制可以选择类似的环境(具有很高的学习潜力),从而使代理商培训在所有这些环境中都是多余的。为此,我们提供了一种有原则性的方法,根据与环境设计有关的新的距离计量,确定适应性的不同环境环境。我们在设计中采用不同的实验性方法,在环境设计方面,我们使用了不同的标准性方法。