We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: batch learning model and rare policy switch model, and propose two efficient online RL algorithms for linear Markov decision processes. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tilde O(\sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dH\log T$ policy switches suffice to obtain the $\tilde O(\sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity.
翻译:我们用适应性约束下的线性功能近似值研究强化学习(RL) 。 我们考虑两种流行的有限适应模式: 批量学习模式和罕见的政策切换模式, 并为线性Markov 决策程序提出两种高效的在线RL算法。 具体地说, 对于批量学习模式, 我们提议的 LSVI- UCB- Batch 算法实现了 $\ tilde O( sqrt{d3H3} + dHT/BT ) 的遗憾。 对于稀有的政策切换模式, 我们提议的 LSVI- UCB- RareSwitch 算法享有 $\ tilde O( sqrt{d3H3$ $ 美元) 的插件长度, 互动次数为$T$ 和 $B$ 美元 。 我们的拟议LS- H3 解算法, 最终意味着 $1+T/ (dH) rock a regrent $。