Solving complex problems using reinforcement learning necessitates breaking down the problem into manageable tasks and learning policies to solve these tasks. These policies, in turn, have to be controlled by a master policy that takes high-level decisions. Hence learning policies involves hierarchical decision structures. However, training such methods in practice may lead to poor generalization, with either sub-policies executing actions for too few time steps or devolving into a single policy altogether. In our work, we introduce an alternative approach to learn such skills sequentially without using an overarching hierarchical policy. We propose this method in the context of environments where a major component of the objective of a learning agent is to prolong the episode for as long as possible. We refer to our proposed method as Sequential Soft Option Critic. We demonstrate the utility of our approach on navigation and goal-based tasks in a flexible simulated 3D navigation environment that we have developed. We also show that our method outperforms prior methods such as Soft Actor-Critic and Soft Option Critic on various environments, including the Atari River Raid environment and the Gym-Duckietown self-driving car simulator.
翻译:通过强化学习解决复杂问题,需要将问题分解为可管理的任务和学习政策,以解决这些问题。反过来,这些政策必须由作出高层次决定的总政策加以控制。因此,学习政策涉及等级决策结构。然而,在实践上培训这类方法可能会导致不全面化,因为次级政策执行行动的时间太短,或者完全演变为单一政策。在我们的工作中,我们引入了一种替代方法,在不使用总体等级政策的情况下,按顺序学习这类技能。我们建议了这种方法,在环境环境中,学习机构的主要组成部分是尽可能延长这一插曲。我们提到我们所提议的方法,即按顺序排列软软选择的分类。我们展示了我们在灵活模拟的3D导航环境中对导航和基于目标的任务的实用性。我们还表明,我们的方法超越了以前的方法,如Soft Actor-Critor-Soft option Critict and Soft option Critict, 包括Atari River Raid环境和Gym-Duckietow 自己驱动木器。