Deep Reinforcement Learning (RL) techniques can benefit greatly from leveraging prior experience, which can be either self-generated or acquired from other entities. Action advising is a framework that provides a flexible way to transfer such knowledge in the form of actions between teacher-student peers. However, due to the realistic concerns, the number of these interactions is limited with a budget; therefore, it is crucial to perform these in the most appropriate moments. There have been several promising studies recently that address this problem setting especially from the student's perspective. Despite their success, they have some shortcomings when it comes to the practical applicability and integrity as an overall solution to the learning from advice challenge. In this paper, we extend the idea of advice reusing via teacher imitation to construct a unified approach that addresses both advice collection and advice utilisation problems. We also propose a method to automatically tune the relevant hyperparameters of these components on-the-fly to make it able to adapt to any task with minimal human intervention. The experiments we performed in 5 different Atari games verify that our algorithm either surpasses or performs on-par with its top competitors while being far simpler to be employed. Furthermore, its individual components are also found to be providing significant advantages alone.
翻译:深入强化学习(RL)技术可以极大地受益于利用先前的经验,这种经验可以是自生的,也可以是从其他实体获得的。行动咨询是一个框架,它提供了一种灵活的方式,以师生同龄人之间的行动形式转让这种知识。然而,由于现实的关注,这些互动的数量与预算相比有限;因此,在最适当的时候执行这些互动至关重要。最近进行了几项有希望的研究,特别从学生的角度来解决这个问题。尽管取得了成功,但是在实际应用性和完整性作为从咨询挑战中学习的总体解决办法方面,它们有一些缺点。在本文中,我们扩大了通过教师模仿推广建议的想法,以构建一种统一的方法,既解决建议收集,又解决建议利用问题。我们还提出了一个方法,自动调整这些组成部分的超分数,使之适应任何任务,特别是从学生的角度看。我们在5种不同的阿塔里游戏中进行的实验证实,我们的算法要么超越了,要么与顶级竞争者一起进行竞争,但又非常简单。此外,个别的部件也被找到来提供显著的优势。