We study the problem of information sharing and cooperation in Multi-Player Multi-Armed bandits. We propose the first algorithm that achieves logarithmic regret for this problem when the collision reward is unknown. Our results are based on two innovations. First, we show that a simple modification to a successive elimination strategy can be used to allow the players to estimate their suboptimality gaps, up to constant factors, in the absence of collisions. Second, we leverage the first result to design a communication protocol that successfully uses the small reward of collisions to coordinate among players, while preserving meaningful instance-dependent logarithmic regret guarantees.
翻译:我们研究了多层多武装匪徒的信息共享与合作问题,我们提出了第一个在碰撞奖励未知时实现对数遗憾的算法。我们的结果基于两个创新。首先,我们表明,可以对连续消除战略进行简单的修改,以便让玩家在没有碰撞的情况下估计其不优化差距,直至经常因素。第二,我们利用第一个结果设计一个通信协议,成功地利用碰撞的微小奖励来协调玩家,同时保留有意义的依赖实例的对数遗憾保证。