拥有许多手臂的无休无止的强盗:击败中央限制理论 (Restless Bandits with Many Arms: Beating the Central Limit Theorem)

We consider finite-horizon restless bandits with multiple pulls per period, which play an important role in recommender systems, active learning, revenue management, and many other areas. While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms $N$. Thus, there is substantial value in understanding the performance of index policies and other policies that can be computed efficiently for large $N$. We study the growth of the optimality gap, i.e., the loss in expected performance compared to an optimal policy, for such policies in a classical asymptotic regime proposed by Whittle in which $N$ grows while holding constant the fraction of arms that can be pulled per period. Intuition from the Central Limit Theorem and the tightest previous theoretical bounds suggest that this optimality gap should grow like $O(\sqrt{N})$. Surprisingly, we show that it is possible to outperform this bound. We characterize a non-degeneracy condition and a wide class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is $O(1)$. These include most widely-used index policies. When this non-degeneracy condition does not hold, we show that fluid-priority policies nevertheless have an optimality gap that is $O(\sqrt{N})$, significantly generalizing the class of policies for which convergence rates are known. We demonstrate that fluid-priority policies offer state-of-the-art performance on a collection of restless bandit problems in numerical experiments.

翻译：我们考虑的是每个时期有多重拉力的不固定的固定强盗,这些强盗在建议系统、积极学习、收入管理和其他许多领域起着重要作用。虽然原则上可以使用动态编程来计算最佳政策,但原则上可以使用动态编程来计算最佳政策。因此,在理解指数政策和其他政策的业绩方面有很大的价值,可以有效地计算大美元。我们研究的是最佳性差距的增长,即与最佳政策相比,预期性能的损失,因为在惠特尔提出的典型的零效制度下,这种政策在典型的零效政策中起着重要作用,美元在不断增长,同时保持每一时期可以拉动的武器比例。从中央限值理论和以往最狭窄的理论界限中推断出,这种最佳性差应该像美元(sqrt{N}美元)一样增长。令人惊讶的是,我们发现,这种最佳性差有可能超过这一约束。我们把非变性状况和一系列新的零效的零效政策定性政策归为零效政策,称为流动性优先度政策,但这种最优性的政策却显示这种不具有最佳性弹性政策。