We consider a large number of agents collaborating on a multi-armed bandit problem with a large number of arms. The goal is to minimise the regret of each agent in a communication-constrained setting. We present a decentralised algorithm which builds upon and improves the Gossip-Insert-Eliminate method of Chawla et al. arxiv:2001.05452. We provide a theoretical analysis of the regret incurred which shows that our algorithm is asymptotically optimal. In fact, our regret guarantee matches the asymptotically optimal rate achievable in the full communication setting. Finally, we present empirical results which support our conclusions
翻译:我们认为有大量代理人在多武装匪徒问题上合作,拥有大量武器,目的是在通信受限制的环境中最大限度地减少每个代理人的遗憾,我们提出了一个分散的算法,该算法以Chawla等人的“Gossip-Insert-Elicate”法为基础,并改进了该方法,即Chawla et al.arxiv:2001.05452. 我们从理论上分析了所发生的遗憾,它表明我们的算法是非现成最佳的。事实上,我们的遗憾保证与在全面通信环境中可以实现的无现成最佳率相匹配。最后,我们介绍了支持我们结论的经验结果。