Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings.
翻译:强化学习(RL)的适应性课程被证明是有效的,可以针对培训与测试环境之间的差异制定强有力的政策。最近,无人监督的环境设计(UED)框架(UED)框架(UED)普遍使用RL课程,以生成整个环境的序列,从而导致采用稳健的迷你遗憾特性的新方法。在部分可观测或随机的环境中,最佳政策可能取决于在预定部署环境中环境的偏差参数的地面真实分布,而课程学习必然改变培训分布。我们将这一现象正式确定为课程引发的共变差(CICS),并描述在偏差参数中发生的现象如何导致不完善的政策。直接从地面图象分布中抽样这些参数避免了问题,但阻碍了课程学习。我们提议了SAMPLR, 一种微负负式的UED方法,优化了地面真相效用功能功能,即使基础培训数据因CICS而偏差。我们证明并验证了具有挑战性的领域,我们的方法在地面分布下保持了最佳性,同时促进整个环境的稳健。