Identifying the best arm of a multi-armed bandit is a central problem in bandit optimization. We study a quantum computational version of this problem with coherent oracle access to states encoding the reward probabilities of each arm as quantum amplitudes. Specifically, we show that we can find the best arm with fixed confidence using $\tilde{O}\bigl(\sqrt{\sum_{i=2}^n\Delta^{\smash{-2}}_i}\bigr)$ quantum queries, where $\Delta_{i}$ represents the difference between the mean reward of the best arm and the $i^\text{th}$-best arm. This algorithm, based on variable-time amplitude amplification and estimation, gives a quadratic speedup compared to the best possible classical result. We also prove a matching quantum lower bound (up to poly-logarithmic factors).
翻译:确定多臂强盗的最佳臂膀是强盗优化的一个中心问题。 我们研究的是这个问题的量子计算版本, 以一致的 oracle 访问状态将每个臂的奖励概率编码为量子振幅。 具体地说, 我们用 $tilde{ O ⁇ bigl (\\\\ qrt\ sum ⁇ i=2\\\\\ Delta ⁇ smash{-2 ⁇ ⁇ i ⁇ i ⁇ i} 量子查询方法来显示我们能找到最好的臂膀。 $delta ⁇ i} 是最佳臂的平均值和 $i{ text{th}- best arm 之间的差额。 这个算法基于可变的振幅振动和估计, 与可能的最佳传统结果相比, 给出了四方形加速。 我们还证明了一个匹配的量较低约束( 至于多logriticric) 。