We consider an N-player multi-armed bandit game where each player chooses one out of M arms for T turns. Each player has different expected rewards for the arms, and the instantaneous rewards are independent and identically distributed or Markovian. When two or more players choose the same arm, they all receive zero reward. Performance is measured using the expected sum of regrets, compared with an optimal assignment of arms to players that maximizes the sum of expected rewards. We assume that each player only knows her actions and the reward she received each turn. Players cannot observe the actions of other players, and no communication between players is possible. We present a distributed algorithm and prove that it achieves an expected sum of regrets of near-O\left(\log T\right). This is the first algorithm to achieve a near order optimal regret in this fully distributed scenario. All other works have assumed that either all players have the same vector of expected rewards or that communication between players is possible.
翻译:我们考虑的是N型玩家多武装强盗游戏,每个玩家都从M型武器中选择一个来转T轮。每个玩家都有不同的预期武器奖赏,瞬间奖赏是独立的,分布相同,或马尔科维安。当两个或两个以上玩家选择同一个手臂时,他们都得到零奖赏。业绩是用预期的遗憾总和来衡量的,而给玩家分配武器的最佳方式是最大限度地增加预期奖赏的总和。我们假设每个玩家只知道她的行动和得到的奖赏。玩家无法观察其他玩家的行动,也不能在玩家之间进行交流。我们提出了一个分布式算法,并证明它能实现接近O\left(\log T\right)的预期遗憾总和。这是在这种完全分布的情况下实现近顺序最佳遗憾的第一个算法。所有其他工程假设,要么所有玩家都有相同的预期奖赏的载体,要么可以进行玩家之间的交流。