This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon non-stationary Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*} \frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.
翻译:本文研究强化学习中无偏探索的问题——即学习者在探索阶段对奖励函数一无所知的情况下,并设计了一种优于现有技术的算法。具体来说,考虑具有有限时间段、非平稳的马尔可夫决策过程,其状态数为 $S$,动作数为 $A$,时间段长度为 $H$,并且监督学习者感兴趣的奖励函数不超过一个多项式数量。通过收集约为 $\frac{SAH^3}{\varepsilon^2}$(对数因子内)的样本集,该算法可以在没有奖励信息的情况下找到所有这些奖励函数的 $\varepsilon$-最优策略,前提是 $\varepsilon$ 足够小。这是该领域首个在此情况下实现可证最小化最坏情况的无关奖励探索方案。此外,一旦样本数量超过 $\frac{S^2AH^3}{\varepsilon^2}$(对数因子内)个 episodes,该算法即可在任意数量的奖励函数上(即使它们是对抗性设计的)达到 $\varepsilon$ 精度,这通常被称为“无奖励探索”。我们算法设计的创新来源于脱机强化学习:探索方案试图最大化一个关键的无关奖励的量,该量决定脱机强化学习的表现,而策略学习模式则借鉴了样本最优脱机强化学习范例的思路。