Most of the existing works for reinforcement learning (RL) with general function approximation (FA) focus on understanding the statistical complexity or regret bounds. However, the computation complexity of such approaches is far from being understood -- indeed, a simple optimization problem over the function class might be as well intractable. In this paper, we tackle this problem by establishing an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm and uses the measurement to guide exploration. For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $\propto\operatorname{poly}\log(K)$ times for running the RL algorithm for $K$ episodes while still achieving a small near-optimal regret bound. In contrast to existing approaches that update the policy for at least $\Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy. When applied to settings in \cite{wang2020reinforcement} or \cite{jin2021bellman}, we improve the overall time complexity by at least a factor of $K$. Finally, we show the generality of our online sub-sampling technique by applying it to the reward-free RL setting and multi-agent RL setting.
翻译:大多数现有的普适函数逼近强化学习(RL)的作品侧重于理解统计复杂性或后悔界限。然而,这种方法的计算复杂度远未被理解,实际上,在函数类中进行的简单优化问题可能同样难以处理。本文通过建立一种有效的在线子抽样框架来解决这个问题,该框架可以测量由RL算法收集的数据点的信息增益,并使用该测量值来指导探索。对于具有复杂度有界函数类的基于价值的方法,我们证明策略只需要更新$ \propto\operatorname {poly}\log(K) $次,在为$K$个剧集运行RL算法时仍然能够实现小的接近最优后悔界限。与现有方法相比,后者需要至少更新$ \Omega(K) $次策略,我们的方法大大减少了求解策略的优化调用次数。当应用于\cite{wang2020reinforcement}或\cite{jin2021bellman}设置时,我们将总体时间复杂度提高了至少$K$倍。最后,我们展示了我们的在线子抽样技术的通用性,将其应用于无奖励RL设置和多智能体RL设置。