The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks.
翻译:终身强化学习(RL)的目标是优化能够在不断变化的环境中不断适应和互动的代理机构,然而,当环境非静止,互动非突发性时,目前的RL方法将严重失败。我们提出终身技能规划(LiSP),这是基于高阶技能抽象空间规划的非突发性终身学习的算法框架。我们以不受监督的方式学习技能,利用学习的动态模型,利用内在的奖赏和计划来取代学习的技能。此外,我们的框架允许技能发现,甚至从离线数据中发现,从而减少了对过度真实世界互动的需求。我们从经验上证明,LiSP成功地促成了长视规划,并学习了能够避免灾难性失败的代理机构,即使在挑战来自网格世界和MuJoCo基准的非静止和非突发性环境时也是如此。