Multi-player multi-armed bandits (MMAB) study how decentralized players cooperatively play the same multi-armed bandit so as to maximize their total cumulative rewards. Existing MMAB models mostly assume when more than one player pulls the same arm, they either have a collision and obtain zero rewards, or have no collision and gain independent rewards, both of which are usually too restrictive in practical scenarios. In this paper, we propose an MMAB with shareable resources as an extension to the collision and non-collision settings. Each shareable arm has finite shareable resources and a "per-load" reward random variable, both of which are unknown to players. The reward from a shareable arm is equal to the "per-load" reward multiplied by the minimum between the number of players pulling the arm and the arm's maximal shareable resources. We consider two types of feedback: sharing demand information (SDI) and sharing demand awareness (SDA), each of which provides different signals of resource sharing. We design the DPE-SDI and SIC-SDA algorithms to address the shareable arm problem under these two cases of feedback respectively and prove that both algorithms have logarithmic regrets that are tight in the number of rounds. We conduct simulations to validate both algorithms' performance and show their utilities in wireless networking and edge computing.
翻译:多玩家多武装匪徒(MMAB)研究分散派的玩家如何合作玩同一个多武装匪徒,以便最大限度地增加其累积的奖励。现有的MMAB模式大多假设,当不止一个玩家拉起同一个手臂时,他们要么碰撞并获得零奖励,要么没有碰撞并获得独立奖赏,两者在实际情景中通常都限制过大。在本文中,我们建议采用一个拥有共享资源的MMAB,将共享资源扩展至碰撞和非闭合环境。每个共享的手臂都有有限的可分享资源和“每载”奖赏随机变量,两者对玩家来说都是未知的。从一个共享的手臂获得的奖励等于“每载”奖赏,乘以拉动手臂的玩家人数和该臂的最大共享资源之间的最低数额。我们考虑两种反馈类型:分享需求信息,分享需求意识(SDADA),其中每一种都提供不同的资源共享信号。我们设计了DPE-SDI和SIC-SD的算法,以便解决这两个反馈案例下的可分享的可分享武器问题。