We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets
翻译:我们研究的是与多个代理人通过通信网络合作的多个代理人的非随机联合多武装盗匪的新问题。 武器损失是由一个模糊的对手分配的, 指向每个代理人的每条手臂的丢失, 我们称之为“ 粗略的对立 ” 。 在这种背景下, 不同的代理人可以在同一步中选择同一个手臂, 但观察不同的反馈。 每个代理人的目标是在后视中找到一个全球最好的手臂, 其累积损失平均为所有代理人的最小, 这需要代理人之间的沟通。 我们为在不同的环境下的任何联合会的土匪算法提供了较低的下限, 当代理人能够获得完整的信息反馈或强盗反馈时。 对于土匪反馈设置, 我们提议一个近于最佳的联合会式土匪算法, 称为FEDEXP3. 我们的算法对Cesa- Bianchi et al. (2016) 提出的一个公开问题给出了肯定的答案: FEDEXP3 可以保证子线级遗憾, 而不必在不同的代理人之间交换选定的武器身份序列或损失序列的分级。 我们还提供我们对真实代理人之间模拟数据结果的理论和数字评估。