Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important models in reinforcement learning, and it is well-known that classical algorithms for bandits with time horizon $T$ suffer $\Omega(\sqrt{T})$ regret. In this paper, we study MAB and SLB with quantum reward oracles and propose quantum algorithms for both models with $O(\mbox{poly}(\log T))$ regrets, exponentially improving the dependence in terms of $T$. To the best of our knowledge, this is the first provable quantum speedup for regrets of bandit problems and in general exploitation in reinforcement learning. Compared to previous literature on quantum exploration algorithms for MAB and reinforcement learning, our quantum input model is simpler and only assumes quantum oracles for each individual arm.
翻译:多武器土匪(MAB)和随机线性线性匪帮(SLB)是加强学习的重要模式,众所周知,具有时间跨度的强盗古典算法($T)会遭受1美元(sqrt{T})的遗憾。在本论文中,我们用量子奖赏符研究MAB和SLB, 并用$O(\mbox{poly}(\log T))为两种模型提出量子算法($O(mbox{poly}(T)))表示遗憾,以指数化的方式提高了对$T的依赖度。 据我们所知,这是第一次对强盗问题表示遗憾和在加强学习中普遍利用的可证实的量子加速。 与以前关于MAB和强化学习量子探索算法的文献相比,我们的量子输入模型比较简单,只为每个手臂假设量子或手。