To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
翻译:为了在强化学习(RL)中实现样本效率,必须高效地探索基本环境。在离线设置下,应对勘探挑战在于收集足够覆盖的离线数据集。受这种挑战的驱动,我们研究无报酬的RL问题,一个代理商的目的是在没有任何事先指定的奖励功能的情况下彻底探索环境。然后,如果有任何外部奖励,代理商则通过在勘探阶段收集的离线数据规划算法来计算政策。此外,我们在功能近似、利用强大功能比对器的背景下处理这一问题。具体地说,我们提议通过一个包含内核和神经功能近似值的令人乐观的增值算法变量进行探索,我们在此过程中采用相关的勘探奖金作为勘探奖励。此外,我们设计了单一代理商MDP游戏和零和马科夫游戏的勘探和规划算法,并证明我们的方法可以达到全局性调量的计算法{O ⁇ 1/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\