This paper studies a cooperative multi-agent multi-armed stochastic bandit problem where agents operate asynchronously -- agent pull times and rates are unknown, irregular, and heterogeneous -- and face the same instance of a K-armed bandit problem. Agents can share reward information to speed up the learning process at additional communication costs. We propose ODC, an on-demand communication protocol that tailors the communication of each pair of agents based on their empirical pull times. ODC is efficient when the pull times of agents are highly heterogeneous, and its communication complexity depends on the empirical pull times of agents. ODC is a generic protocol that can be integrated into most cooperative bandit algorithms without degrading their performance. We then incorporate ODC into the natural extensions of UCB and AAE algorithms and propose two communication-efficient cooperative algorithms. Our analysis shows that both algorithms are near-optimal in regret.
翻译:本文研究一个合作性多试剂多臂多臂强盗问题,即代理人操作时速和速度不明、不规则、不规则、不均匀,并面临K型武装强盗问题的相同实例。代理人可以分享奖励信息,以额外的通信成本加快学习过程。我们提议ODC,即一个按需通信协议,根据经验拉动时间调整每对代理人的通信。当代理人的拉动时间高度不一时,ODC效率很高,其通信复杂性取决于代理人的经验拉动时间。ODC是一个通用协议,可以纳入大多数合作强盗算法,而不会降低他们的表现。我们随后将ODC纳入UB和AAE算法的自然扩展,并提出两种通信效率高的合作算法。我们的分析表明,这两种算法几乎都是最理想的。