We consider a stochastic bandit problem with countably many arms that belong to a finite set of types, each characterized by a unique mean reward. In addition, there is a fixed distribution over types which sets the proportion of each type in the population of arms. The decision maker is oblivious to the type of any arm and to the aforementioned distribution over types, but perfectly knows the total number of types occurring in the population of arms. We propose a fully adaptive online learning algorithm that achieves O(log n) distribution-dependent expected cumulative regret after any number of plays n, and show that this order of regret is best possible. The analysis of our algorithm relies on newly discovered concentration and convergence properties of optimism-based policies like UCB in finite-armed bandit problems with "zero gap," which may be of independent interest.
翻译:我们考虑的是属于一定种类的、具有独特平均奖赏特征的众多武器的巨大盗匪行为问题。 此外,对各类武器进行固定分配,确定武器人口中每种类型的比例。 决策者忽视了任何类型的武器,并完全了解武器人口中出现的类型总数。 我们提出一个完全适应性的在线学习算法,在任何玩耍n之后,实现O(log n)分配依赖分配的累积遗憾,并表明这种遗憾的顺序是最好的。我们对算法的分析依赖于新发现的基于乐观政策的集中和趋同性,如UCBB在“零差距”的有限武装匪帮问题上的集中和趋同性,而“零差距”可能具有独立的兴趣。