We study the problem of stochastic bandits with adversarial corruptions in the cooperative multi-agent setting, where $V$ agents interact with a common $K$-armed bandit problem, and each pair of agents can communicate with each other to expedite the learning process. In the problem, the rewards are independently sampled from distributions across all agents and rounds, but they may be corrupted by an adversary. Our goal is to minimize both the overall regret and communication cost across all agents. We first show that an additive term of corruption is unavoidable for any algorithm in this problem. Then, we propose a new algorithm that is agnostic to the level of corruption. Our algorithm not only achieves near-optimal regret in the stochastic setting, but also obtains a regret with an additive term of corruption in the corrupted setting, while maintaining efficient communication. The algorithm is also applicable for the single-agent corruption problem, and achieves a high probability regret that removes the multiplicative dependence of $K$ on corruption level. Our result of the single-agent case resolves an open question from Gupta et al. [2019].
翻译:我们研究了合作性多试剂环境下有对抗性腐败的盗贼问题,在多试剂环境中,美元代理人与共同的美元持械盗匪问题相互作用,每对代理人可以相互沟通,以加快学习过程。在这一问题中,奖赏从所有代理人和子弹的分布中独立抽样,但可能被对手腐蚀。我们的目标是最大限度地减少所有代理人的总体遗憾和通信成本。我们首先表明,在这个问题上任何算法都不可避免有腐败的累加术语。然后,我们提出一种新的算法,这种算法是不可想象的到腐败程度的。我们的算法不仅在腐败环境中取得了近乎最佳的遗憾,而且还在腐败环境中获得了腐败累加期的遗憾,同时保持了高效的沟通。算法也适用于单一代理人腐败问题,并且非常有可能对消除腐败水平上多倍的美元依赖性表示遗憾。我们单一代理人案件的结果解决了古普塔等人(2019年)提出的一个未决问题。