In reinforcement learning applications like robotics, agents usually need to deal with various input/output features when specified with different state/action spaces by their developers or physical restrictions. This indicates unnecessary re-training from scratch and considerable sample inefficiency, especially when agents follow similar solution steps to achieve tasks. In this paper, we aim to transfer similar high-level goal-transition knowledge to alleviate the challenge. Specifically, we propose PILoT, i.e., Planning Immediate Landmarks of Targets. PILoT utilizes the universal decoupled policy optimization to learn a goal-conditioned state planner; then, distills a goal-planner to plan immediate landmarks in a model-free style that can be shared among different agents. In our experiments, we show the power of PILoT on various transferring challenges, including few-shot transferring across action spaces and dynamics, from low-dimensional vector states to image inputs, from simple robot to complicated morphology; and we also illustrate a zero-shot transfer solution from a simple 2D navigation task to the harder Ant-Maze task.
翻译:在强化学习应用中,例如机器人,代理商通常需要在其开发商或物理限制的不同状态/行动空间指定时处理各种投入/产出特点,这表明从零开始进行不必要的再培训和大量抽样低效率,特别是在代理商采取类似的解决方案步骤完成任务时。在本文件中,我们的目标是转让类似的高层次目标过渡知识,以减轻挑战。具体地说,我们提议PILOT,即“规划目标的立即地标”。PILOT利用普遍脱钩的政策优化来学习一个有目标条件的国家规划器;然后,蒸馏一个目标规划器,以无型模式的方式规划可在不同代理商之间共享的即时里程碑。在我们的实验中,我们展示了PILOT在各种转移挑战上的力量,包括从低维矢量矢量状态到图像输入,从简单的机器人到复杂的形态学;我们还展示了从简单的2D导航任务到更难的Ant-Maze任务的零发传输解决方案。