Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .
翻译:强化学习方面的技能或低层次政策是时间上延伸的行动,可以加快学习,并促成复杂的行为。最近开展的离线强化学习和模仿学习工作已经从一组专家轨迹中提出了几种技能发现技术技术的技术。虽然这些方法很有希望,但要发现的技能数量K总是一个固定的超参数,需要事先了解环境,或者需要额外参数搜索来调整环境。我们首先提出一种脱线学习选项的方法(一个特定技能框架),利用变异推断和持续放松的进展。我们然后强调巴伊西亚非参数性非参数和离线技能发现之间未探索的连接,并展示如何获得我们模型的非参数性版本。由于精心构建的近似外表外表和动态变化的多个选项,这个版本是可以牵动的。我们还要说明我们的非参数扩展可如何应用于其他技能框架,并且从经验上证明我们的方法可以超越不同环境中的离线技能学习算法。我们的代码可以在 https://gibbal/BO6/blabs。