We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protocols for this setting, inspired by opinion dynamics that are well-studied for other algorithmic tasks in the GOSSIP model. For stationary reward settings, we prove for the first time that these simple protocols exhibit best-of-both-worlds behavior, simultaneously obtaining constant cumulative regret scaling like $R(T)/T = \widetilde O(1/T)$, and also reaching consensus on the highest-mean action within $\widetilde O(\sqrt{n})$ rounds. We obtain these results by showing a new connection between the global evolution of these decentralized protocols and a class of zero-sum multiplicative weights update} processes. Using this connection, we establish a general framework for analyzing the population-level regret and other properties of our protocols. Finally, we show our protocols are also surprisingly robust to adversarial rewards, and in this regime we obtain sublinear regret scaling like $R(T)/T = \widetilde O(1/\sqrt{T})$ as long as the number of rounds does not grow too fast as a function of $n$.
翻译:暂无翻译