We study the best-arm identification problem in multi-armed bandits with stochastic, potentially private rewards, when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a (non-private) successive elimination algorithm for strictly optimal best-arm identification, we show that our algorithm is $\delta$-PAC and we characterize its sample complexity. Further, we provide a lower bound on the expected number of pulls, showing that the proposed algorithm is essentially optimal up to logarithmic factors. Both upper and lower complexity bounds depend on a special definition of the associated suboptimality gap, designed in particular for the quantile bandit problem, as we show when the gap approaches zero, best-arm identification is impossible. Second, motivated by applications where the rewards are private, we provide a differentially private successive elimination algorithm whose sample complexity is finite even for distributions with infinite support-size, and we characterize its sample complexity. Our algorithms do not require prior knowledge of either the suboptimality gap or other statistical information related to the bandit problem at hand.
翻译:我们研究的是多武装强盗中具有随机性、潜在的私人奖赏的最好的武器识别问题,当目标是在固定的、规定的水平上辨别手臂与最高四分位数时。首先,我们提出一个(非私营的)连续消除算法,以严格优化最佳武器识别,我们显示我们的算法是$delta$-PAC,我们对其样本复杂性作了区分。此外,我们提供了一个关于预期拉动次数的较低界限,表明提议的算法基本上与对数因素相比是最佳的。高低的复杂界限都取决于相关次优化差距的特殊定义,特别是针对四分位断层问题,正如我们所显示的那样,当差距接近零时,最佳武器识别是不可能的。第二,由于各种应用的动机,当奖励是私人的时,我们提供了一种差别的私人连续消除算法,其样本复杂性即使是在无限支持规模的分布上也是有限的,我们对其样本复杂性加以定性。我们的算法不需要事先知道亚最佳差距或手顶问题的其他统计信息。