Offline Reinforcement learning (RL) has shown potent in many safe-critical tasks in robotics where exploration is risky and expensive. However, it still struggles to acquire skills in temporally extended tasks. In this paper, we study the problem of offline RL for temporally extended tasks. We propose a hierarchical planning framework, consisting of a low-level goal-conditioned RL policy and a high-level goal planner. The low-level policy is trained via offline RL. We improve the offline training to deal with out-of-distribution goals by a perturbed goal sampling process. The high-level planner selects intermediate sub-goals by taking advantages of model-based planning methods. It plans over future sub-goal sequences based on the learned value function of the low-level policy. We adopt a Conditional Variational Autoencoder to sample meaningful high-dimensional sub-goal candidates and to solve the high-level long-term strategy optimization problem. We evaluate our proposed method in long-horizon driving and robot navigation tasks. Experiments show that our method outperforms baselines with different hierarchical designs and other regular planners without hierarchy in these complex tasks.
 翻译:在探索风险大且费用昂贵的机器人中,离线强化学习(RL)在许多安全关键任务中显示,在探索风险大且费用昂贵的机器人中,它在许多安全关键任务中表现出了强大的力量。然而,它仍然难以获得在时间上延长的任务方面的技能。在本文件中,我们研究了离线RL在时间上延长的任务方面的问题。我们提出了一个等级规划框架,包括低水平的、有目标限制的RL政策和高水平的目标规划员。低层次政策是通过离线RL培训的。我们通过绕动目标取样程序改进离线培训,以处理分配目标之外的目标。我们改进了离线培训。高层次规划员通过利用基于模型的规划方法的优势选择中期次级目标。我们根据所学的低层次政策的价值功能规划今后的次级目标序列。我们采用了一个条件性变异自动编码来抽样有意义的高层次子目标候选人,并解决高层次的长期战略优化问题。我们评估了我们提议的在长方位驾驶和机器人导航任务中的方法。实验表明,我们的方法超越了我们的方法,而没有这些复杂的等级设计和其他定期规划任务中的不同层次设计。