通过复杂任务分配配配培训过渡政策 (Training Transition Policies via Distribution Matching for Complex Tasks)

Humans decompose novel complex tasks into simpler ones to exploit previously learned skills. Analogously, hierarchical reinforcement learning seeks to leverage lower-level policies for simple tasks to solve complex ones. However, because each lower-level policy induces a different distribution of states, transitioning from one lower-level policy to another may fail due to an unexpected starting state. We introduce transition policies that smoothly connect lower-level policies by producing a distribution of states and actions that matches what is expected by the next policy. Training transition policies is challenging because the natural reward signal -- whether the next policy can execute its subtask successfully -- is sparse. By training transition policies via adversarial inverse reinforcement learning to match the distribution of expected states and actions, we avoid relying on task-based reward. To further improve performance, we use deep Q-learning with a binary action space to determine when to switch from a transition policy to the next pre-trained policy, using the success or failure of the next subtask as the reward. Although the reward is still sparse, the problem is less severe due to the simple binary action space. We demonstrate our method on continuous bipedal locomotion and arm manipulation tasks that require diverse skills. We show that it smoothly connects the lower-level policies, achieving higher success rates than previous methods that search for successful trajectories based on a reward function, but do not match the state distribution.

翻译：人类将新的复杂任务分化为更简单的任务, 以利用先前学到的技能。类似地, 等级强化学习试图利用较低层次的政策来利用较低层次的政策来解决复杂的任务。但是, 因为每一个较低层次的政策导致不同的州分配, 从一个较低层次的政策向另一个低层次的政策过渡可能由于意外的起始状态而失败。我们引入过渡政策, 顺利地将较低层次的政策联系起来, 通过产生与下一个政策预期相符的国家分配和行动。培训过渡政策具有挑战性, 因为自然奖赏信号 -- -- 下一个政策能否成功地执行它的子任务 -- 很少。通过对抗性反向强化学习来培训过渡政策以匹配预期的国家和行动的分配, 我们避免依赖基于任务的奖励。为了进一步提高业绩, 我们用一个二进制行动空间来进行深入的学习, 来决定何时从过渡政策向下一个预先训练的政策转变, 利用下一个子任务的成功或失败作为奖励。尽管奖赏仍然很少, 问题并不严重, 是因为简单的二进行动空间。我们通过对抗性反向的强化学习来训练过渡政策, 避免依赖基于基于任务分配的成功技能的连续双向更深层次的悬殊的分类, 我们用的方法, 将成功的成绩转换了以前的方法来显示以成功的搜索的技巧, 成功的方法, 成功的方法需要一种不同的方法, 成功的方法将过去的学习, 成功的方法是显示以前的方法, 成功的方法, 成功的搜索式的技巧。