Designing provably efficient algorithms with general function approximation is an important open problem in reinforcement learning. Recently, Wang et al.~[2020c] establish a value-based algorithm with general function approximation that enjoys $\widetilde{O}(\mathrm{poly}(dH)\sqrt{K})$\footnote{Throughout the paper, we use $\widetilde{O}(\cdot)$ to suppress logarithm factors. } regret bound, where $d$ depends on the complexity of the function class, $H$ is the planning horizon, and $K$ is the total number of episodes. However, their algorithm requires $\Omega(K)$ computation time per round, rendering the algorithm inefficient for practical use. In this paper, by applying online sub-sampling techniques, we develop an algorithm that takes $\widetilde{O}(\mathrm{poly}(dH))$ computation time per round on average, and enjoys nearly the same regret bound. Furthermore, the algorithm achieves low switching cost, i.e., it changes the policy only $\widetilde{O}(\mathrm{poly}(dH))$ times during its execution, making it appealing to be implemented in real-life scenarios. Moreover, by using an upper-confidence based exploration-driven reward function, the algorithm provably explores the environment in the reward-free setting. In particular, after $\widetilde{O}(\mathrm{poly}(dH))/\epsilon^2$ rounds of exploration, the algorithm outputs an $\epsilon$-optimal policy for any given reward function.
翻译:设计具有一般函数近似值的高效算法是强化学习中一个重要的开放问题。 最近, Wang 等人 ~ [2020c] 建立了基于价值的算法, 具有通用函数近似值, 享有 $Ubertilde{O} (\ mathrm{poly} (dH)\\\ sqrt{K}) $\ foot{ 本文中, 我们使用 $\ 广域tilde{O} (\\ cddd) 来抑制对数系数 。 } 很遗憾, 美元取决于功能类别的复杂性, $H$( ) 是规划的地平线, $(k) 是事件总数。 然而, 他们的算法需要$\ 美元( K) 计算每回合的时段, 使得算法效率低用于实际用途。 在本文中, 我们开发一个算法, 需要$( mathret{ O} (matterm{poly} (dH) 每回合计算任何对数值的计算时间, 并且只有相同的遗憾地定义。 此外, 。 此外, 运算在实际操作中, $_ rodeal- expeople- expeople expecial rial expeal) 值里, 。