For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/\epsilon^2)$, where $\epsilon>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $\rho>0$, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $\epsilon$-optimal policy for a post-revealed reward with sub-optimality gap at least $\rho$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of $\epsilon$. We show that, information-theoretically, this bound is nearly tight for $\rho < \Theta(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.
翻译:对于任务- 高级强化学习( RL) 问题, 代理商首先从未知环境中收集样本, 而不受奖赏信号监督, 然后以奖励方式披露样本, 并被要求计算相应的近最佳政策。 现有方法主要关注最坏的情景, 其中没有使用奖赏/ 过渡动力学的结构性信息。 因此, 最好的样本上限是 $\ popto\ 广度{ O ⁇ ( 1/\ epsilon2) $( $\ epsil> 0), 其中, 获得的政策的准确性是 $1\ 美元( 美元) 。 我们3SA/ + H%4 S=2 A) 。 为了解决这个问题, 我们提供了一个高效的算法, 使用一个差距参数 $\ rho>0, 来减少勘探的数量。 特别是对于一个未知的限定- marizon Markov 决策程序, 算法只需要$ ( 1\ \ lipilon $ ( $) ( 我们3SA/ $ + hold 美元 美元) 美元 美元) ( 美元 美元) 美元) rocial2 A) latial- dial dial dial dial extial extial extime ex) exal extime the a put the putal ex ex extime ex