We study a collaborative multi-agent stochastic linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret. In this setting, each agent has its own linear bandit problem (its own reward parameter) and the goal is to select the best global action w.r.t. the average of their reward parameters. At each round, each agent proposes an action, and one action is randomly selected and played as the network action. All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents. We propose a distributed upper confidence bound (UCB) algorithm and prove a high probability bound on its $T$-round regret in which we include a linear growth of regret associated with each communication round. Our regret bound is of order $\mathcal{O}\Big(\sqrt{\frac{T}{N \log(1/|\lambda_2|)}}\cdot (\log T)^2\Big)$, where $\lambda_2$ is the second largest (in absolute value) eigenvalue of the communication matrix.
翻译:我们研究一个协作性多试剂的线性土匪设置, 由当地形成网络的一美元代理商在当地进行交流, 以尽量减少他们的总体遗憾。 在这个设置中, 每个代理商都有自己的线性土匪问题( 其自己的奖赏参数), 目标是选择最佳的全球行动 w.r.t. 平均奖赏参数。 每一轮中, 每个代理商都提议一个动作, 一个动作是随机选择的, 并且作为网络动作。 所有代理商都观察所玩动作的相应奖赏, 并使用加速的协商一致程序计算所有代理商所获奖赏的平均值。 我们建议一个分布式的上层信任( UB) 算法, 并证明一个高概率, 取决于每轮中我们包含与每轮交流有关的遗憾的线性增长。 我们的遗憾是按 $\macal{O\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\