a 训练单一土匪武装 (Training a Single Bandit Arm)

In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to train novice workers of unknown quality in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from $T$ pulls, we consider the vector of cumulative rewards earned from the $K$ arms at the end of $T$ pulls, and aim to maximize the expected value of the highest cumulative reward across the $K$ arms. This corresponds to the objective of training a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and an instance-independent regret of $\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy, featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). Our numerical experiments demonstrate the efficacy of this policy compared to several natural alternatives in practical parameter regimes.

翻译：在多武装盗匪问题的若干应用中,尽量扩大预期报酬总额的传统目标可能是不适当的。由于优化工作分配以培训劳动平台中质量不明的新工人的问题,我们考虑在古典设置中的新目标。我们不考虑从美元拉力中最大限度地获得预期总报酬,而是考虑在美元拉力结束时从武器获得的累积报酬的矢量,并力求最大限度地提高所有武器最高累积报酬的预期值。这与培训单一、高技能工人利用有限的培训工作供应来培训高技能工人的目标相对应。为了这一新目标,我们表明任何政策都必须产生一个以美元为例的零点的零点遗憾(比起传统目标时要更依赖常数),而不是在美元拉力结束时从武器获得累积收益的矢量,而是要尽量提高美元(K ⁇ 1/3}T ⁇ 2/3})的预期价值。然后我们设计一个探索式的替代政策,在适当调整信心的基础上,根据有限的培训工作供应量对单一、高技能工人进行培训。为了这个新目标,我们表明任何政策都必须以美元(T)为依次的零,但遗憾的是,任何政策都必须对美元(与传统目标不变的数值不变的不变)进行一个比照标准,以比对若干自然奖赏和调整了这个标准,从而能测算。