技术技能:高效热量探索的适应性技能定级 (SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration)

Giulia Vezzani,Dhruva Tirumala,Markus Wulfmeier,Dushyant Rao,Abbas Abdolmaleki,Ben Moran,Tuomas Haarnoja,Jan Humplik,Roland Hafner,Michael Neunert,Claudio Fantacci,Tim Hertweck,Thomas Lampe,Fereshteh Sadeghi,Nicolas Heess,Martin Riedmiller

The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.

翻译：有效再利用先前知识的能力是建立通用和灵活的强化学习(RL)代理物的关键要求。技能再利用是最常见的方法之一,但目前的方法有相当大的局限性。例如,微调现行政策常常失败,因为政策在培训中可以早期迅速退化。类似地,专家行为的提炼可能会在给定亚最佳专家时导致不良结果。我们比较了多种领域技能转让的几种共同方法,包括任务和系统动态的变化。我们找出了现有方法如何失败,并引入了缓解这些问题的替代方法。我们的方法学会排列现有时间延伸的勘探技能,但直接从原始经验中学习最终政策。这一概念的划分使得能够快速适应,从而有效地收集数据,但又不限制最终解决办法。它大大超越了一套评估任务中的许多传统方法,我们用一套广泛的布局来突出我们方法的不同组合的重要性。