Exploration is one of the most important tasks in Reinforcement Learning, but it is not well-defined beyond finite problems in the Dynamic Programming paradigm (see Subsection 2.4). We provide a reinterpretation of exploration which can be applied to any online learning method. We come to this definition by approaching exploration from a new direction. After finding that concepts of exploration created to solve simple Markov decision processes with Dynamic Programming are no longer broadly applicable, we reexamine exploration. Instead of extending the ends of dynamic exploration procedures, we extend their means. That is, rather than repeatedly sampling every state-action pair possible in a process, we define the act of modifying an agent to itself be explorative. The resulting definition of exploration can be applied in infinite problems and non-dynamic learning methods, which the dynamic notion of exploration cannot tolerate. To understand the way that modifications of an agent affect learning, we describe a novel structure on the set of agents: a collection of distances (see footnote 7) $d_{a} \in A$, which represent the perspectives of each agent possible in the process. Using these distances, we define a topology and show that many important structures in Reinforcement Learning are well behaved under the topology induced by convergence in the agent space.
翻译:探索是加强学习中最重要的任务之一,但除了动态规划模式(见2.4分节)中有限的问题之外,它不是界定得很清楚。我们重新解释可以适用于任何在线学习方法的勘探方法。我们从新的方向进行勘探,从而得出这一定义。在发现为解决简单的Markov决策程序而创造的探索概念与动态规划不再广泛适用之后,我们重新审查勘探方法。我们不扩大动态勘探程序的目的,而是扩大其手段。这是,而不是反复抽样在进程中可能的每一州对行动,我们定义了使一个代理本身适应于探索的行为。由此产生的勘探定义可以适用于无穷的问题和非动态学习方法,而动态的探索概念无法容忍这些问题和非动态学习方法。为了理解一个代理的修改会影响学习的方式,我们描述了一套代理物的新结构:一个距离收集(见脚注7) $d ⁇ a}\ a$,它代表了这一过程中每个代理物的观点。我们用这些距离来定义一个顶部,并表明加强空间学习的许多重要结构在高层研究中是良好的。