Model-based reinforcement learning (RL) is a sample-efficient way of learning complex behaviors by leveraging a learned single-step dynamics model to plan actions in imagination. However, planning every action for long-horizon tasks is not practical, akin to a human planning out every muscle movement. Instead, humans efficiently plan with high-level skills to solve complex tasks. From this intuition, we propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step. For accurate and efficient long-term planning, we jointly learn the skill dynamics model and a skill repertoire from prior experience. We then harness the learned skill dynamics model to accurately simulate and plan over long horizons in the skill space, which enables efficient downstream learning of long-horizon, sparse reward tasks. Experimental results in navigation and manipulation domains show that SkiMo extends the temporal horizon of model-based approaches and improves the sample efficiency for both model-based RL and skill-based RL. Code and videos are available at https://clvrai.com/skimo
翻译:以模型为基础的强化学习(RL)是一种通过利用学习的单步动态模型来规划想象中的行动来学习复杂行为的样本高效方法。然而,规划每次长期横向任务的行动并不实用,而类似于人类规划每次肌肉运动。相反,人类以高技能高效计划解决复杂任务。我们从这一直觉中建议了一个基于技能的模型RL框架(SkiMo),它能够利用一种技能动态模型在技能空间进行规划,这种模型直接预测技能成果,而不是一步步预测中间各州的所有小细节。为了准确和高效的长期规划,我们共同学习技能动态模型,并从以往的经验中重新学习技能。然后我们利用学习的技能动态模型,在技能空间进行准确的模拟和规划,从而能够高效率地在下游学习长程、稀薄的奖励任务。导航和操作领域的实验结果显示,SkiMo扩大了基于模型的方法的时间范围,提高了基于模型的方法和基于技能的Rski/L代码的样本效率。 可在 httpscl/clima 上找到的图像。