Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining $\sqrt{T}$-type regret bound, where $T$ is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can achieve $\tilde{O}(d^{3}H^5/\text{gap}_{\text{min}}\cdot \log(T))$ regret; and under the linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR algorithm can achieve $\tilde{O}(d^{2}H^5/\text{gap}_{\text{min}}\cdot \log^3(T))$ regret, where $d$ is the dimension of feature mapping, $H$ is the length of episode, $\text{gap}_{\text{min}}$ is the minimal sub-optimality gap, and $\tilde O$ hides all logarithmic terms except $\log(T)$. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.
翻译:使用线性函数近似值的强化学习(RL)最近受到越来越多的关注。然而,现有工作的重点是获取$\sqrt{T}美元类型的遗憾,其中美元是与MDP的互动次数。在本文件中,我们表明,在最近提出的两个线性 MDP假设下,对数遗憾是可以实现的,条件是在最佳行动价值功能方面存在积极的亚最佳差值。更具体地说,在线性 MDP假设(Jin 等人 2019)下,LSVI-UCB 算法可以实现$\tilde{O}(d ⁇ 3}H}5/\ text{gap{tle{mincdot\log(T)) ;在线性混合 MDP 假设(Ayob 等人 和 Al. 2020) 下,UCRCRL-VTR算法可以达到正的亚优性差值 {gtext{cdration{cdivility) $D$H$H$@gral deminal deminal deal deal destrital destrital destrital ex ex rude slations flations flations flations flations flations flations flations flations flations flations flations flations flations flations flations flations flations flations flations flations flations = = = = =所有最低值。 = =tal = = = = lictal = = =tal = ==================tal lical lical lictal =tal =tal lical =tal =tal =tal =tal ===taltal =taltal lical ======================tal lical = = = ===