We consider a stochastic bandit problem with a possibly infinite number of arms. We write $p^*$ for the proportion of optimal arms and $\Delta$ for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting in terms of the problem parameters $T$ (the budget), $p^*$ and $\Delta$. For the objective of minimizing the cumulative regret, we provide a lower bound of order $\Omega(\log(T)/(p^*\Delta))$ and a UCB-style algorithm with matching upper bound up to a factor of $\log(1/\Delta)$. Our algorithm needs $p^*$ to calibrate its parameters, and we prove that this knowledge is necessary, since adapting to $p^*$ in this setting is impossible. For best-arm identification we also provide a lower bound of order $\Omega(\exp(-cT\Delta^2p^*))$ on the probability of outputting a sub-optimal arm where $c>0$ is an absolute constant. We also provide an elimination algorithm with an upper bound matching the lower bound up to a factor of order $\log(1/\Delta)$ in the exponential, and that does not need $p^*$ or $\Delta$ as parameter.
翻译:我们考虑的是可能无限数量的军火的沙袋问题。 我们为最佳武器的比例和最佳和亚最佳武器之间的最低平均比例写$p $ 美元,为最优和亚最佳武器之间的最小平均比例写$Delta$。 我们从问题参数的累计遗憾设置和最佳武器识别中,从问题参数的角度,我们考虑的是: $T(预算)、 $p 美元和$美元。为了最大限度地减少累积的遗憾,我们提供了较低的订单(Omega (glog(T)/(p ⁇ Delelta)) 美元和UCB式的运算法,与上捆绑定的美元($)相匹配。 我们的算法需要$p 美元来校准参数,我们证明这种知识是必要的,因为在这个设置中调整到$%美元是不可能的。 为了尽可能低的识别,我们还提供较低的订单,在以Omega(\\\\T\Delta2p$) 美元为基价, 而不是以Oral- comma a commain a stal deal as a pal asion.