规划与实践:通过在低层空间设定目标实现高效在线精美调整 (Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space)

General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach configurable goals for a wide range of tasks on command. However, such goal-conditioned policies are notoriously difficult and time-consuming to train from scratch. In this paper, we propose Planning to Practice (PTP), a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve. Our approach is based on two key ideas. First, we decompose the goal-reaching problem hierarchically, with a high-level planner that sets intermediate subgoals using conditional subgoal generators in the latent space for a low-level model-free policy. Second, we propose a hybrid approach which first pre-trains both the conditional subgoal generator and the policy on previously collected data through offline reinforcement learning, and then fine-tunes the policy via online exploration. This fine-tuning process is itself facilitated by the planned subgoals, which breaks down the original target task into short-horizon goal-reaching tasks that are significantly easier to learn. We conduct experiments in both the simulation and real world, in which the policy is pre-trained on demonstrations of short primitive behaviors and fine-tuned for temporally extended tasks that are unseen in the offline data. Our experimental results show that PTP can generate feasible sequences of subgoals that enable the policy to efficiently solve the target tasks.

翻译：通用机器人需要各种各样的行为组合,以完成现实世界无结构环境中具有挑战性的任务。解决这个问题, 以目标为条件的强化学习旨在获得能够为一系列广泛的指挥任务达到可配置目标的政策。然而, 此类有目标的机器人政策非常困难,而且从零开始训练耗时。在本文件中, 我们提出“ 计划到实践”, 这种方法可以切实地为长期任务制定有目标限制的政策, 而这需要多种不同类型的互动才能解决。我们的方法基于两个关键理念。首先, 我们从等级上将目标影响的问题分解出来, 由高层规划者来设置中期次级目标, 在潜在的空间里, 使用有条件的子目标生成者来设定中期次级目标, 以便从零开始训练。其次, 我们提出一种混合方法, 首先是将有条件的子目标生成者和先前收集的数据政策通过离线学习, 然后通过在线探索对政策进行微调。这个微调过程本身是由计划好的子目标所推动的, 将目标分级化为: 在原始的实验中, 我们的最初的实验中, 将原始的实验性任务将最终的实验任务分为一个目标, 我们的最初的实验任务将开始, 开始的实验任务将开始开始开始, 开始的, 开始的实验任务开始, 开始, 开始的实验任务开始开始的开始开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始的开始开始的开始的开始的开始的开始开始的开始开始开始的开始的开始的开始的开始的开始的开始开始开始开始开始开始开始开始开始开始开始开始开始开始