We study the problem of information sharing and cooperation in Multi-Player Multi-Armed bandits. We propose the first algorithm that achieves logarithmic regret for this problem. Our results are based on two innovations. First, we show that a simple modification to a successive elimination strategy can be used to allow the players to estimate their suboptimality gaps, up to constant factors, in the absence of collisions. Second, we leverage the first result to design a communication protocol that successfully uses the small reward of collisions to coordinate among players, while preserving meaningful instance-dependent logarithmic regret guarantees.
翻译:我们研究了多层多武装匪徒的信息共享与合作问题,我们提出了第一个实现对数遗憾的算法。我们的结果基于两个创新。首先,我们表明,对连续消除战略的简单修改可以用来让玩家在没有碰撞的情况下估计其不最优化差距,直至经常因素。第二,我们利用第一个结果设计一个通信协议,成功地利用小型碰撞奖励来协调玩家之间的协调,同时保留有意义的依赖实例的对数遗憾保证。