We study the Pareto frontier of two archetypal objectives in stochastic bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To make this precise, we first design and analyze the BoBW-lil'UCB$({\gamma})$ algorithm, which achieves order-wise optimal performance for RM or BAI under different values of ${\gamma}$. Complementarily, we show that no algorithm can simultaneously perform optimally for both the RM and BAI objectives. More precisely, we establish non-trivial lower bounds on the regret achievable by any algorithm with a given BAI failure probability. This analysis shows that in some regimes BoBW-lil'UCB$({\gamma})$ achieves Pareto-optimality up to constant or small terms. Numerical experiments further demonstrate that when applied to difficult instances, BoBW-lil'UCB outperforms a close competitor UCB$_{\alpha}$ (Degenne et al., 2019), which is designed for RM and BAI with a fixed confidence.
翻译:我们研究了两大目标的Pareto边界,即最小化(RM)和最佳武器识别(BAI),具有固定的地平线。关于开采和勘探之间的平衡对于RM和BAI都至关重要,但勘探对于实现后一目标的最佳性能更为关键,我们研究的是Pareto边界线的边界线,我们首先设计和分析BoBW-lil'UB$(gamma})的算法,这种算法在美元的不同值下,使RM或BAI达到最优性能的一致。此外,我们表明,没有一种算法能够同时为RM和BAI的目标同时发挥最佳性能。更确切地说,我们对BOW-IL'UCB$(sgamma})在任何算法上都能实现的最遗憾程度的边缘线。我们的分析表明,在某些制度下,BBW-li'UCB$(s)中,在固定值或小值条件下,实现Preto-opyal 实验进一步证明,当应用困难实例时,BW-l'IAIAI'B(request)和ABCB)是固定的20CBCBCBCBRB。和固定和固定和正成型。