This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (R-ed-UCB) and one for the restless case (R-less-UCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset. Finally, using synthetic and real-world data, we illustrate the effectiveness of the proposed approaches compared with state-of-the-art algorithms for the non-stationary bandits.
翻译:本文涉及随机多武装强盗(MABs)领域,即仅使用所选选项(a.k.a.a. a. a. arm)提供的反馈,能够在线学习的顺序选择技术。我们研究了一个无休止和无休止的匪徒的个案,在这个个案中,武器预期的回报是单调的,而不是消退。这一特征使得我们能够设计专门设计的算法,利用定期支付来提供严格的遗憾界限。我们设计了一种算法(R-ed-UCB),一种算法(R-less-UCB),一种算法(R-less-UCB),根据实例的特性,提供一定的遗憾。在某些情况下,我们用美元(T ⁇ frac{2 ⁇ 3 ⁇ %%)来说明武器预期的回报是单调的。我们将我们的算法与非固定的MAB公司最先进的方法对几项合成产生的任务和真实世界数据集的在线模式选择问题进行比较。最后,我们使用合成和真实世界的模型数据,用非世界级的算法比较了拟议的系统。