Achieving robust performance is crucial when applying deep reinforcement learning (RL) in safety critical systems. Some of the state of the art approaches try to address the problem with adversarial agents, but these agents often require expert supervision to fine tune and prevent the adversary from becoming too challenging to the trainee agent. While other approaches involve automatically adjusting environment setups during training, they have been limited to simple environments where low-dimensional encodings can be used. Inspired by these approaches, we propose genetic curriculum, an algorithm that automatically identifies scenarios in which the agent currently fails and generates an associated curriculum to help the agent learn to solve the scenarios and acquire more robust behaviors. As a non-parametric optimizer, our approach uses a raw, non-fixed encoding of scenarios, reducing the need for expert supervision and allowing our algorithm to adapt to the changing performance of the agent. Our empirical studies show improvement in robustness over the existing state of the art algorithms, providing training curricula that result in agents being 2 - 8x times less likely to fail without sacrificing cumulative reward. We include an ablation study and share insights on why our algorithm outperforms prior approaches.
翻译:在安全临界系统中应用深度强化学习(RL)时,实现稳健的性能至关重要。一些先进的方法试图与对抗性代理商解决问题,但这些代理商往往需要专家监督,以微调和避免对手对受训代理商过于挑战性。虽然其他方法涉及在培训期间自动调整环境设置,但它们局限于可以使用低维编码的简单环境。受这些方法的启发,我们提出了遗传学课程,这种算法可自动确定代理商目前失败的情景,并生成相关课程以帮助代理商学习解决情景并获得更强有力的行为。作为一个非参数优化者,我们的方法使用原始的、非固定的情景编码,减少专家监督的需要,并允许我们的算法适应代理人不断变化的业绩。我们的经验研究表明,在艺术算法的现有状态上,其稳健性有所改善,提供了培训课程,使代理商在不牺牲累积性奖励的情况下,其失败的可能性为2-8倍。我们包括一项模拟研究,并分享关于为何我们的算法比先前的方法要差的原因的见解。