有许多最优臂的强盗 (Bandits with many optimal arms)

We consider a stochastic bandit problem with a possibly infinite number of arms. We write $p^*$ for the proportion of optimal arms and $\Delta$ for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting in terms of the problem parameters $T$ (the budget), $p^*$ and $\Delta$. For the objective of minimizing the cumulative regret, we provide a lower bound of order $\Omega(\log(T)/(p^*\Delta))$ and a UCB-style algorithm with matching upper bound up to a factor of $\log(1/\Delta)$. Our algorithm needs $p^*$ to calibrate its parameters, and we prove that this knowledge is necessary, since adapting to $p^*$ in this setting is impossible. For best-arm identification we also provide a lower bound of order $\Omega(\exp(-cT\Delta^2 p^*))$ on the probability of outputting a sub-optimal arm where $c>0$ is an absolute constant. We also provide an elimination algorithm with an upper bound matching the lower bound up to a factor of order $\log(T)$ in the exponential, and that does not need $p^*$ or $\Delta$ as parameter. Our results apply directly to the three related problems of competing against the $j$-th best arm, identifying an $\epsilon$ good arm, and finding an arm with mean larger than a quantile of a known order.

翻译：我们考虑的是武器数量可能无限的盗匪问题。我们为最佳武器的比例和最佳和亚最佳武器之间的最小平均比例写$p $ $,为最小平均比例写$Delta$。我们在累积的遗憾环境以及在问题参数的最好武器识别设置中, 我们用最好的学习率, 美元( 预算)、 $p $ 美元和 $\ Delta 美元。为了尽可能减少累积的遗憾, 我们提供了较低的订单 $( log ( T) / ( P ⁇ Delta ) 美元) 和 UCB 式算法, 匹配最高约束系数, 美元( 1/ Delta ) 美元。我们的算法需要用$来校正其参数, 我们证明这种知识是必需的, 因为在这个环境里, 调整到 $%( $) 美元。为了尽可能低的排序( exc) 美元( expl) 和美元( 美元) 直径( 美元) 直径( ) 直线) 和美元( 美元) 直径(ral) 根) 根) 根(ral) 根( ) 根) 根) 根( 根) 根) 的概率的概率的概率的概率的概率概率的概率概率概率, 也提供最小的概率,,, 需要提供一个固定一个固定的比一个固定的比一个固定的比一个固定的比一个固定的。