We introduce a generic strategy for provably efficient multi-goal exploration. It relies on AdaGoal, a novel goal selection scheme that is based on a simple constrained optimization problem, which adaptively targets goal states that are neither too difficult nor too easy to reach according to the agent's current knowledge. We show how AdaGoal can be used to tackle the objective of learning an $\epsilon$-optimal goal-conditioned policy for all the goal states that are reachable within $L$ steps in expectation from a reference state $s_0$ in a reward-free Markov decision process. In the tabular case with $S$ states and $A$ actions, our algorithm requires $\tilde{O}(L^3 S A \epsilon^{-2})$ exploration steps, which is nearly minimax optimal. We also readily instantiate AdaGoal in linear mixture Markov decision processes, which yields the first goal-oriented PAC guarantee with linear function approximation. Beyond its strong theoretical guarantees, AdaGoal is anchored in the high-level algorithmic structure of existing methods for goal-conditioned deep reinforcement learning.
翻译:我们引入了一种通用战略,以可实现效率高的多目标勘探。 它依赖于Ada目标,这是一个基于简单限制优化问题的新的目标选择计划,其基础是简单的限制优化问题,适应性目标目标目标指出,根据代理人目前的知识,它并非太难或太容易达到。我们展示了如何利用Ada目标实现学习一个以美元为最高目标的优于目标的政策的目标,所有目标都表明,从一个无报酬的Markov决定程序中,从一个参考州可达到的以美元为单位的以美元为单位的预期步骤以美元为单位。在用美元和以美元为单位的表格中,我们的算法要求以美元为单位的勘探步骤(L3S =A \ = epsilon}-2}) 美元,这几乎是微缩式最佳的。我们还在线性混合物Markov决策过程中以线性功能近似可实现第一个面向目标的PAC保证。除了其强有力的理论保证外,Ada目标性深度强化的现有方法的高水平算法结构以Alished。