We study an important variant of the stochastic multi-armed bandit (MAB) problem, which takes penalization into consideration. Instead of directly maximizing cumulative expected reward, we need to balance between the total reward and fairness level. In this paper, we present some new insights in MAB and formulate the problem in the penalization framework, where rigorous penalized regret can be well defined and more sophisticated regret analysis is possible. Under such a framework, we propose a hard-threshold UCB-like algorithm, which enjoys many merits including asymptotic fairness, nearly optimal regret, better tradeoff between reward and fairness. Both gap-dependent and gap-independent regret bounds have been established. Multiple insightful comments are given to illustrate the soundness of our theoretical analysis. Numerous experimental results corroborate the theory and show the superiority of our method over other existing methods.
翻译:我们研究的是考虑惩罚的多武装盗匪问题的一个重要变体。我们不是直接尽量扩大累积的预期报酬,而是需要在总报酬和公平水平之间取得平衡。在本文中,我们提出一些新见解,并在惩罚框架内提出问题,在这个框架内,严惩的悔恨可以很好地界定,并有可能进行更复杂的悔恨分析。在这样一个框架内,我们提出了一种硬门槛的类似UCB的算法,它享有许多优点,包括无药可依的公平性、近乎最佳的遗憾、报酬与公平之间更好的权衡。已经确立了依赖差距和依赖差距的悔恨界限。我们提出了多种深刻的评论,以说明我们理论分析的正确性。许多实验结果证实了理论,并表明我们方法优于其他现有方法。