We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.
翻译:我们考虑的是“大动作空间”引发的多武装匪徒问题。 我们研究的是“ 大动作空间” 引发的、 拥有一大批武器,每批武器完全包含K$的军火类型,每个武器都有明显的平均奖赏。 决策者忽视了奖励分配的统计性质以及不同武装类型人口分布的统计性质, 也无法在游戏后观察一只手臂的类型。 我们研究的是将预期累积的遗憾降到最低的游戏地平线$的典型问题。 我们研究的是将预期累积的遗憾降到最低, 并提议各种算法, 以速度- 最高、 时间- 取决于案例的方式对$\ mathal{ Oleft(\log n\right) 表示遗憾。 我们还显示, 取决于实例的( minimmax) 遗憾是$tilde\ mathcal{ O ⁇ left (\ right) $ k=2$。 问题的遗憾和复杂程度表明与典型的MAB问题非常相似, 业绩的特性和关键的方面, 和关键的算法分析是需要的。