Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like \epsilon-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches -- because they summarize past interactions -- with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD). We derive upper confidence bounds on the action-values learned by LSTD, with context-dependent (or state-dependent) noise variance. Such context-dependent noise focuses exploration on a subset of variable states, and allows for reduced exploration in other states. We empirically demonstrate that our algorithm can converge more quickly than other incremental exploration strategies using confidence estimates on action-values.
翻译:强化学习的直接勘探战略对于在与环境的最小互动次数中学习最佳政策至关重要。许多算法利用乐观,通过访问估计或最高信任度来引导勘探,而不是使用随机、非定向勘探等数据效率低的战略。大多数数据效率高的勘探方法都需要大量计算,通常依赖一个经验丰富的模型来指导勘探。最差的方位方法有可能提供模型方法的一些数据效率效益 -- -- 因为它们总结了过去的互动 -- -- 与接近于无模型方法的计算。在这项工作中,我们提供了一种新的、计算效率高的、递增的勘探战略,利用最小值时间差异的这一特性。我们从LSTD学到的行动价值中获取了高度信任的界限,而这种依赖环境的噪音则取决于背景(或取决于国家)噪音差异。这种基于环境的噪音的探索侧重于一个可变的子状态,并允许在其他国家减少勘探。我们的经验证明,我们的算法可以比其他渐进的勘探战略更快地结合,使用对行动价值的信任估计。