This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.
翻译:本文探讨利用背景规划进行基于模型的强化学习的新方法:将动态方案更新和无模型更新(类似于Dyna结构)混合(近似)动态方案更新和无模型更新,与所学模型的背景规划往往比不使用模型的替代方法(例如双重数字QN)更糟糕,即使前者使用的记忆和计算量要多得多。根本问题是,所学模型可能不准确,往往产生无效状态,特别是在迭代许多步骤时。在本文件中,我们避免了这一限制,将背景规划限制在一套(抽象的)次级目标和仅学习本地的次级目标限制模式上。这种目标空间规划方法在计算上效率更高,自然包含时间抽象,以便更快地进行长程规划,并避免完全学习转型动态。我们表明,我们的普惠制算法在各种情况下的学习速度可以大大快于双重数字N基线。