While modern policy optimization methods can do complex manipulation from sensory data, they struggle on problems with extended time horizons and multiple sub-goals. On the other hand, task and motion planning (TAMP) methods scale to long horizons but they are computationally expensive and need to precisely track world state. We propose a method that draws on the strength of both methods: we train a policy to imitate a TAMP solver's output. This produces a feed-forward policy that can accomplish multi-step tasks from sensory data. First, we build an asynchronous distributed TAMP solver that can produce supervision data fast enough for imitation learning. Then, we propose a hierarchical policy architecture that lets us use partially trained control policies to speed up the TAMP solver. In robotic manipulation tasks with 7-DoF joint control, the partially trained policies reduce the time needed for planning by a factor of up to 2.6. Among these tasks, we can learn a policy that solves the RoboSuite 4-object pick-place task 88% of the time from object pose observations and a policy that solves the RoboDesk 9-goal benchmark 79% of the time from RGB images (averaged across the 9 disparate tasks).
翻译:虽然现代政策优化方法可以对感官数据进行复杂的操纵,但是它们会因时间跨度延长和多个子目标的问题而挣扎。 另一方面,任务和运动规划方法(TAMP)的规模会长视远视,但计算成本很高,需要精确跟踪世界状态。我们建议一种方法,利用两种方法的力量:我们训练一项政策,模仿TAMP求解器的输出。这产生了一种进化前进政策,能够完成感官数据多步任务。首先,我们建立一个分散的不同步的TAMP求解器,能够产生足够快的模拟学习的监督数据。然后,我们提出一个等级政策架构,让我们使用经过部分训练的控制政策加快TAMP求解答器的速度。在7-DoF联合控制的机器人操作任务中,经过部分训练的政策将规划所需的时间减少到2.6倍。其中,我们可以学习一项政策,解决 RoboSite 4-object 将88%的时间从对象提出观测结果和解决 RGB 9- blassk 9- basal 基准时间(RGB 9-bosk bas- bas-bal legal grational lagal gradual gradual) graphal lax imx im) ty timedudududududududustress 79)。