We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al., 2020b), linear completeness (Zanette et al., 2020b) and low-rank MDPs with unknown representation (Modi et al., 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.
翻译:我们根据一般非线性功能近似法研究无报酬强化学习(RL),并在各种标准结构假设下确定抽样效率和硬度结果。从积极的方面看,我们提议在最低结构假设下,为抽样有效无报酬勘探采用无报酬的RFOLIVE(无报酬无报酬无报酬)算法(RL),该算法涵盖以前研究过的线性MDP(Jin等人,2020年b)、线性完整性(Zanette等人,2020年b)和代表不明的低级别MDP(Modi等人,2021年)。我们的分析表明,先前为后两种环境作出的可探测性或可实现性假设在统计上对于无报酬勘探来说并非必要。 在负面方面,当基本特征不明时,我们在线性完整性假设下提供了无报酬无报酬勘探的统计硬性结果,显示低级和线性完整性环境的指数分化。