In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.
翻译:在本文中, 我们考虑一个匪帮问题, 每一组都有数个匪帮, 每一组由无数武器组成。 每当要求某一组新手臂时, 其平均奖赏来自未知的储油层分配( 每个组不同 ), 而该臂的平均奖赏的不确定性只能通过随后的手臂拉动来减少。 我们的目标是确定储油层分配量最高( 1-\ alpha) $- quantile( 例如, 如果$alpha =\ frac {1\2}$, 中位值), 使用尽可能少的手臂拉动。 我们引入了两步算法, 首先要求每个组各有固定数量的军火, 然后运行一个有限武器组最大量的盗匪队算法 。 我们既区分以实例为主的和最坏的遗憾, 并且为后者提供一个匹配更低的界限, 同时讨论各种强、 弱点、 算法改进 和 与我们以实例为主的上限相关的较低界限 。