We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. We show that with no communication at all, such guarantees are, surprisingly, not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some regimes of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author and enjoys the same strong no-collision property, while our lower bound is based on a topological obstruction and holds even under full information.
翻译:我们研究的是多玩家多武装盗匪问题。 在这个问题上, 美元玩家合作, 以最大限度地从$K> mum 武器中获取全部回报。 但是, 如果玩家同时拉起同一个手臂, 就不能沟通, 并且受到惩罚( 比如得不到任何奖励 ) 。 我们问, 是否有可能获得最佳依赖实例的遗憾$\ tilde{O}( 1/\\ Delta) $( delta), 而在这种情况下, $\ Delta$ 是 美元与 $m+m+1$- 最佳武器之间的差额。 这些保证是最近在一个模式中实现的, 允许玩家通过故意碰撞进行隐性交流。 我们显示, 完全没有沟通, 这种保证是无法实现的。 事实上, 获得最佳的 $\ tillde{ O} ( 1/\\ Delta) $( 1/\ delta) 的遗憾, 必然意味着其他体制中 $ 绝对的次优的遗憾。 我们的主要结果是完全地描述Pareto 最佳依赖实例的权衡交易, 可能与无任何联系的。 我们的总算法和在Bubeckalisleval、 brealislatealislus 和 brolatexxx 10 basinforlatexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx