规划自学 (Self-Imitation Learning by Planning)

Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. In this work, we solve this problem using our proposed approach called {self-imitation learning by planning (SILP)}, where demonstration data are collected automatically by planning on the visited states from the current policy. SILP is inspired by the observation that successfully visited states in the early reinforcement learning stage are collision-free nodes in the graph-search based motion planner, so we can plan and relabel robot's own trials as demonstrations for policy learning. Due to these self-generated demonstrations, we relieve the human operator from the laborious data preparation process required by IL and RL methods in solving complex motion planning tasks. The evaluation results show that our SILP method achieves higher success rates and enhances sample efficiency compared to selected baselines, and the policy learned in simulation performs well in a real-world placement task with changing goals and obstacles.

翻译：模拟学习(IL) 使机器人能够通过传授在强化学习(RL)中广泛采用的专家知识迅速获得技能,以启动探索。然而,在长视距运动规划任务中,部署IL和RL方法的一个棘手问题是如何生成和收集大规模、广泛分布的数据,使这些方法能够有效地推广。在这项工作中,我们用我们提议的称为 { 自我模仿学习规划(SILP) 的方法解决这个问题,即通过规划从当前政策中从所访问的各州自动收集演示数据。 SILP 方法的灵感来自在早期强化学习阶段成功访问的各州的观察,是图形搜索运动规划员的无碰撞节点,因此我们可以规划和重新标注机器人自己的试验,作为政策学习示范。由于这些自发的演示,我们解除了人类操作员在IL和RL方法下解决复杂动作规划任务所需的劳累性数据编制过程。评价结果表明,我们的SILP方法取得了更高的成功率,提高了样本比选定的基线的效率,在模拟工作中学到了与实际任务设置障碍的政策。