We study the Improving Multi-Armed Bandit (IMAB) problem, where the reward obtained from an arm increases with the number of pulls it receives. This model provides an elegant abstraction for many real-world problems in domains such as education and employment, where decisions about the distribution of opportunities can affect the future capabilities of communities and the disparity between them. A decision-maker in such settings must consider the impact of her decisions on future rewards in addition to the standard objective of maximizing her cumulative reward at any time. In many of these applications, the time horizon is unknown to the decision-maker beforehand, which motivates the study of the IMAB problem in the technically more challenging horizon-unaware setting. We study the tension that arises between two seemingly conflicting objectives in the horizon-unaware setting: a) maximizing the cumulative reward at any time based on current rewards of the arms, and b) ensuring that arms with better long-term rewards get sufficient opportunities even if they initially have low rewards. We show that, surprisingly, the two objectives are aligned with each other in this setting. Our main contribution is an anytime algorithm for the IMAB problem that achieves the best possible cumulative reward while ensuring that the arms reach their true potential given sufficient time. Our algorithm mitigates the initial disparity due to lack of opportunity and continues pulling an arm till it stops improving. We prove the optimality of our algorithm by showing that a) any algorithm for the IMAB problem, no matter how utilitarian, must suffer $\Omega(T)$ policy regret and $\Omega(k)$ competitive ratio with respect to the optimal offline policy, and b) the competitive ratio of our algorithm is $O(k)$.
翻译:我们研究的是改善多武装盗匪(IMAB)问题,即从一个手臂获得的奖励随着它得到的拉动次数的增加而增加。这个模型为教育和就业等领域的许多现实世界问题提供了一个优雅的抽象概念,在教育和就业等领域,关于机会分配的决定可以影响社区未来的能力和它们之间的差别。在这种环境下,决策者必须考虑她的决定对未来奖励的影响,以及在任何时候尽可能增加其累积报酬的标准目标。在许多这些应用中,决策者对时间跨度并不熟悉,这促使在技术上更具挑战性的地平线软件设置中研究IMAB问题。我们研究了在地平线-快道设置中两个似乎相互矛盾的目标之间的紧张关系:a)根据目前武器报酬,在任何时间上最大限度地增加累积奖励;b)确保拥有更好的长期奖励的武器获得充分的机会,即使最初的回报很低。我们发现,这两个目标彼此吻合。我们的主要贡献是随时对IMAB的竞争力比率,在技术上更富有挑战性的地平价成本上,我们最有可能在军备上取得最佳的回报。