In this paper, we study multi-armed bandits (MAB) and stochastic linear bandits (SLB) with heavy-tailed rewards and quantum reward oracle. Unlike the previous work on quantum bandits that assumes bounded/sub-Gaussian distributions for rewards, here we investigate the quantum bandits problem under a weaker assumption that the distributions of rewards only have bounded $(1+v)$-th moment for some $v\in (0,1]$. In order to achieve regret improvements for heavy-tailed bandits, we first propose a new quantum mean estimator for heavy-tailed distributions, which is based on the Quantum Monte Carlo Mean Estimator and achieves a quadratic improvement of estimation error compared to the classical one. Based on our quantum mean estimator, we focus on quantum heavy-tailed MAB and SLB and propose quantum algorithms based on the Upper Confidence Bound (UCB) framework for both problems with $\Tilde{O}(T^{\frac{1-v}{1+v}})$ regrets, polynomially improving the dependence in terms of $T$ as compared to classical (near) optimal regrets of $\Tilde{O}(T^{\frac{1}{1+v}})$, where $T$ is the number of rounds. Finally, experiments also support our theoretical results and show the effectiveness of our proposed methods.
翻译:在本文中,我们研究的是多武装土匪(MAB)和具有重尾奖赏和量子奖赏的线性线性土匪(SLB),这与以前关于以约束/sub-Gausian 分配奖励的量子土匪的工作不同。我们在此调查量子土匪问题,所依据的假设是,一些美元(0,1美元)的奖赏分配只与(1+v)美元($)挂钩。为了对重尾土匪问题取得遗憾的改善,我们首先提出一个新的重尾分配量平均估量标准,该标准以Quantum Monte Carme imator为基础,与经典的相比,对估计误差有四分化。基于我们的量平均估量估计,我们侧重于量重整的MAB和SLB,并根据最高信任库框架提出量衡算法。 与$(T){O}(T)+$($)提议的问题相比,对于重尾分配分配量值分配量值的量值平均估计值估算值(T$)值, 也是最短的理论结果。