Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.
翻译:选择具有丰富经验的探索性行动,以产生丰富经验,促进更好的学习,是强化学习中的一项根本挑战(RL)。 解决这一问题的方法包括根据长期的特定政策选择行动,这种选择也称为选项。 最近一项旨在得出这些探索性选项的工作,以Laplacian图的元件为基础。 重要的是,直到现在,这些方法大多局限于表格域:(1) Laplacian 矩阵图要么提供,要么可以充分估计,(2) 对该矩阵进行eigendecommation是可计算可移动的,(3) 价值函数可以精确地学习。此外,这些方法需要有一个单独的选项发现阶段。这些假设从根本上说是无法伸缩的。在本文中,我们讨论这些局限性,并展示如何利用直接接近Laplaplacecian 元元元功能的最新结果来真正扩大基于选项的探索。为了做到这一点,我们引入了一种完全在线的深度RL算法,用于发现基于 Laplacecian 选项,并评估我们在各种基于像素的任务上的方法。我们将这些方法与一些州和州级的探索方法进行比较,这是非常有希望的普通的、非常有效的方法。