Model-based reinforcement learning (RL) is a sample-efficient way of learning complex behaviors by leveraging a learned single-step dynamics model to plan actions in imagination. However, planning every action for long-horizon tasks is not practical, akin to a human planning out every muscle movement. Instead, humans efficiently plan with high-level skills to solve complex tasks. From this intuition, we propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step. For accurate and efficient long-term planning, we jointly learn the skill dynamics model and a skill repertoire from prior experience. We then harness the learned skill dynamics model to accurately simulate and plan over long horizons in the skill space, which enables efficient downstream learning of long-horizon, sparse reward tasks. Experimental results in navigation and manipulation domains show that SkiMo extends the temporal horizon of model-based approaches and improves the sample efficiency for both model-based RL and skill-based RL. Code and videos are available at \url{https://clvrai.com/skimo}
翻译:以模型为基础的强化学习(RL)是一种抽样有效的方法,通过利用一个学习的单步动态模型来规划想象中的行动,来学习复杂的复杂行为。然而,规划每次长视线任务的行动并不实用,而类似于人类规划每次肌肉运动。相反,人类以高技能高效计划解决复杂任务。我们从这一直觉中建议了一个基于技能的模型RL框架(Skimo),它能够利用一种技能动态模型在技能空间进行规划,这种模型直接预测技能成果,而不是一步一步预测中间各州的所有小细节。为了进行准确有效的长期规划,我们共同学习技能动态模型,并从以往的经验中重新学习技能。我们然后利用学习的技能动态模型来准确模拟和规划技能空间的长视线,从而能够高效率地在下游学习长视线、稀薄的奖励任务。导航和操纵领域的实验结果显示,SkiMo扩大了基于模型的方法的时间范围,提高了基于模型的方法和基于技能的Rlaiur/Rlagr{Col}的样本效率。