We consider a decentralized multiplayer game, played over $T$ rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint bandit-reward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting.
翻译:我们考虑的是分散式的多玩家游戏,玩了超过$T的回合,有一个由定向自行车图描述的领导者-追随者等级。 对于每一回合,图形结构决定着玩家的顺序和玩家如何观察彼此的行动。在每回合结束时,所有玩家都得到联合土匪奖赏,其依据是他们的联合行动,用来更新玩家战略,以达到尽量减少联合伪金矿的目标。我们展示了一种学习算法,这种算法是由单一玩家多武装土匪问题所启发的,并显示它在对抗和随机土匪奖赏的回合数量上达到了亚线性联合伪金矿。此外,我们量化了由于我们的问题分散性与集中环境相比所产生的成本。