多武装气球强盗 (Ballooning Multi-Armed Bandits)

In this paper, we introduce Ballooning Multi-Armed Bandits (BL-MAB), a novel extension to the classical stochastic MAB model. In BL-MAB model, the set of available arms grows (or balloons) over time. In contrast to the classical MAB setting where the regret is computed with respect to the best arm overall, the regret in a BL-MAB setting is computed with respect to the best available arm at each time. We first observe that the existing MAB algorithms are not regret-optimal for the BL-MAB model. We show that if the best arm is equally likely to arrive at any time, a sub-linear regret cannot be achieved, irrespective of the arrival of other arms. We further show that if the best arm is more likely to arrive in the early rounds, one can achieve sub-linear regret. Our proposed algorithm determines (1) the fraction of the time horizon for which the newly arriving arms should be explored and (2) the sequence of arm pulls in the exploitation phase from among the explored arms. Making reasonable assumptions on the arrival distribution of the best arm in terms of the thinness of the distribution's tail, we prove that the proposed algorithm achieves sub-linear instance-independent regret. We further quantify the explicit dependence of regret on the arrival distribution parameters. We reinforce our theoretical findings with extensive simulation results.

翻译：在本文中,我们介绍气球多武装强盗(BL-MAB),这是古典摩托性MAB模型的新扩展。在BL-MAB模型中,一套可用的武器随着时间的推移会生长(或气球)。与传统的MAB模型相比,如果最好的手臂更有可能到达,那么在BL-MAB环境中的遗憾就会在每次最好的手臂上计算。我们首先看到,现有的MAB算法对于BL-MAB模型来说并不是一个遗憾最佳的。我们表明,如果最好的手臂在任何时间都同样可能到达,那么无论其他武器的到来,都不可能实现亚线性遗憾。我们进一步表明,如果最好的手臂更有可能到达,那么在早期的回合中就能够取得亚线性遗憾。我们提议的算法确定:(1) 探索新到达的军火的时间范围,以及(2) 在探索的军火的开采阶段,手臂的顺序会拉近。我们对最佳手臂的到达分布作出合理的假设,无论其他武器的到来,不论其他武器的到来与否,都不可能实现亚线性遗憾。我们提议的精确性分析分析结果的到来后,我们进一步地证明了我们最后的到来的到来,我们最后的到来。

相关内容