Deep Reinforcement Learning achieves very good results in domains where reward functions can be manually engineered. At the same time, there is growing interest within the community in using games based on Procedurally Content Generation (PCG) as benchmark environments since this type of environment is perfect for studying overfitting and generalization of agents under domain shift. Inverse Reinforcement Learning (IRL) can instead extrapolate reward functions from expert demonstrations, with good results even on high-dimensional problems, however there are no examples of applying these techniques to procedurally-generated environments. This is mostly due to the number of demonstrations needed to find a good reward model. We propose a technique based on Adversarial Inverse Reinforcement Learning which can significantly decrease the need for expert demonstrations in PCG games. Through the use of an environment with a limited set of initial seed levels, plus some modifications to stabilize training, we show that our approach, DE-AIRL, is demonstration-efficient and still able to extrapolate reward functions which generalize to the fully procedural domain. We demonstrate the effectiveness of our technique on two procedural environments, MiniGrid and DeepCrawl, for a variety of tasks.
翻译:深入强化学习在可以人工设计奖励功能的领域取得了非常好的成果;同时,社区内对使用基于程序内容生成(PCG)的游戏作为基准环境的兴趣日益浓厚,因为这种类型的环境对研究超配和广化受域变换的物剂来说是完美的;反强化学习(IRL)可以取代专家演示的奖励功能外推法,即使对高维问题也是如此,但是没有将这些技术应用于程序产生的环境的实例;这主要是由于寻找一个良好的奖励模式所需的演示次数。我们建议采用基于反反强化学习的技术,这可以大大减少PCG游戏中专家演示的需要。通过使用一套有限的初始种子水平的环境,加上一些对稳定培训的修改,我们表明我们的DE-AIRL方法(DE-AIRL)具有示范效率,仍然能够将奖励功能外推至整个程序领域。我们展示了两种程序环境(MiniGrid和DeepClawl)的技术的有效性,以完成多种任务。