Solving complex problems using reinforcement learning necessitates breaking down the problem into manageable tasks, either explicitly or implicitly, and learning policies to solve these tasks. These policies, in turn, have to be controlled by a master policy that takes high-level decisions. This requires a training algorithm to take this hierarchical decision structure into account when learning these policies. However, training such methods in practice may lead to poor generalization, with either sub-policies executing actions for too few time steps or devolving into a single policy altogether. In our work, we introduce an alternative approach to learn such skills sequentially without using an overarching hierarchical policy. We propose this method in the context of environments where a major component of the objective of a learning agent is to prolong the episode for as long as possible. We refer to our proposed method as Sequential Soft Option Critic. We demonstrate the utility of our approach on navigation and goal-based tasks in a flexible simulated 3D navigation environment that we have developed. We also show that our method outperforms prior methods such as Soft Actor-Critic and Soft Option Critic on our environment, as well as the Gym-Duckietown self-driving car simulator and the Atari River Raid environment.
翻译:通过强化学习解决复杂的问题,需要将问题分为明确或隐含的可管理的任务,以及学习解决这些任务的政策。这些政策反过来必须由作出高层决定的总政策加以控制。这要求一种培训算法,在学习这些政策时将这种等级决策结构考虑在内。然而,在实际中培训这种方法可能会导致不适当地概括化,因为次级政策执行行动的时间太短,或者完全演变成一个单一的政策。在我们的工作中,我们采用另一种办法,在不使用总体等级政策的情况下,按顺序学习这种技能。我们提议在环境中采用这种方法,学习代理人的目标的一个主要组成部分是尽可能延长这一事件的时间。我们指的是我们所提议的方法,即 " 顺序软软软软软软选项 " 和 " 软选项 " 。我们用一个灵活模拟的3D导航环境展示了我们航行和基于目标的任务方法的实用性。我们还表明,我们的方法比我们环境的软动作-软动作和软选项缩式的硬盘式工具要优于以往的方法,同时是Atri-Dribal-Driamtown-dal-dal-driving enal-victor onourn-dal-dal-dable-driviolvictortaltalturtal-d-dom-dalturtalturtalturtal