We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards. We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret assuming completeness and bounded Eluder dimension for the regression function class. As a special case, VO$Q$L achieves $\tilde{O}(d\sqrt{HT}+d^6H^{5})$ regret over $T$ episodes for a horizon $H$ MDP under ($d$-dimensional) linear function approximation, which is asymptotically optimal. Our algorithm incorporates weighted regression-based upper and lower bounds on the optimal value function to obtain this improved regret. The algorithm is computationally efficient given a regression oracle over the function class, making this the first computationally tractable and statistically optimal approach for linear MDPs.
翻译:我们根据一般函数近似值和微弱的回报度研究时间-异质强化学习(RL)。我们设计了一种新的算法,即差异加权最佳美元(VO$Q$)学习法(VO$Q$),该算法基于Q$的学习,并约束其为回归函数级假设完整性和约束性 Eluder 维度的遗憾。作为一个特例,VO$QL 达到 $\tdelde{O}(d\qrt{HT ⁇ d ⁇ 6H ⁇ 5}) 以美元为单位,对在(美元-维)直线性函数近似值下以$HDP($-d$-demode)为单位的地平线性计算出超过$T片段的情况感到遗憾。我们的算法结合了基于加权回归法的上限和下限的最佳值功能,以获得这一改进的遗憾。考虑到功能级的回归法,计算效率很高,使线性 MDP首次的计算和统计上最佳方法。