We study multiplayer stochastic multi-armed bandit problems in which the players cannot communicate and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward. We consider two feedback models: a model in which the players can observe whether a collision has occurred and a more difficult setup when no collision information is available. We give the first theoretical guarantees for the second model: an algorithm with a logarithmic regret, and an algorithm with a square-root regret type that does not depend on the gaps between the means. For the first model, we give the first square-root regret bounds that do not depend on the gaps. Building on these ideas, we also give an algorithm for reaching approximate Nash equilibria quickly in stochastic anti-coordination games.
翻译:我们研究多个玩家无法沟通的多武装强盗问题,如果两个或两个以上玩家拉起同一个手臂,就会发生碰撞,所涉玩家会得到零报酬。我们考虑了两个反馈模式:一个是玩家可以观察碰撞是否发生的模型,另一个是没有碰撞信息时更难设置。我们给第二个模式提供了第一个理论保证:一个带有对数悔恨的算法,一个不取决于手段间差距的平根遗憾类型的算法。对于第一个模式,我们给出了不依赖差距的第一块平方根遗憾界限。基于这些想法,我们还给出了一种算法,用于在随机反协调游戏中快速达到近似纳什平衡的算法。