We propose a novel solution to challenging sparse-reward, continuous control problems that require hierarchical planning at multiple levels of abstraction. Our solution, dubbed AlphaNPI-X, involves three separate stages of learning. First, we use off-policy reinforcement learning algorithms with experience replay to learn a set of atomic goal-conditioned policies, which can be easily repurposed for many tasks. Second, we learn self-models describing the effect of the atomic policies on the environment. Third, the self-models are harnessed to learn recursive compositional programs with multiple levels of abstraction. The key insight is that the self-models enable planning by imagination, obviating the need for interaction with the world when learning higher-level compositional programs. To accomplish the third stage of learning, we extend the AlphaNPI algorithm, which applies AlphaZero to learn recursive neural programmer-interpreters. We empirically show that AlphaNPI-X can effectively learn to tackle challenging sparse manipulation tasks, such as stacking multiple blocks, where powerful model-free baselines fail.
翻译:我们提出了应对低回报、连续控制问题的新解决方案,这需要多个层次的抽象化等级规划。 我们的解决方案,称为阿尔法NPI-X,涉及三个不同的学习阶段。 首先,我们使用具有经验重放的非政策强化学习算法学习一套原子目标限制政策,这可以很容易地重新用于许多任务。 其次,我们学习描述原子政策对环境影响的自建模型。 第三,利用自建模型学习具有多个抽象层次的循环组合程序。 关键的洞察力是,自建模型能够通过想象力进行规划,在学习更高层次的构造程序时不需要与世界互动。 为了完成第三阶段的学习,我们扩展了阿尔法涅普算法,应用阿尔法Zero来学习循环神经程序员-互动者。 我们的经验显示,阿尔法NPI-X可以有效地学会解决挑战性微小的操纵任务,例如堆叠多块,没有模型的基线失效。