We consider a linear stochastic bandit problem involving $M$ agents that can collaborate via a central server to minimize regret. A fraction $\alpha$ of these agents are adversarial and can act arbitrarily, leading to the following tension: while collaboration can potentially reduce regret, it can also disrupt the process of learning due to adversaries. In this work, we provide a fundamental understanding of this tension by designing new algorithms that balance the exploration-exploitation trade-off via carefully constructed robust confidence intervals. We also complement our algorithms with tight analyses. First, we develop a robust collaborative phased elimination algorithm that achieves $\tilde{O}\left(\alpha+ 1/\sqrt{M}\right) \sqrt{dT}$ regret for each good agent; here, $d$ is the model-dimension and $T$ is the horizon. For small $\alpha$, our result thus reveals a clear benefit of collaboration despite adversaries. Using an information-theoretic argument, we then prove a matching lower bound, thereby providing the first set of tight, near-optimal regret bounds for collaborative linear bandits with adversaries. Furthermore, by leveraging recent advances in high-dimensional robust statistics, we significantly extend our algorithmic ideas and results to (i) the generalized linear bandit model that allows for non-linear observation maps; and (ii) the contextual bandit setting that allows for time-varying feature vectors.
翻译:我们考虑的是涉及美元代理人的线性随机盗匪问题,它们可以通过中央服务器进行合作,以最大限度地减少遗憾。这些代理人中,有一小部分的美元是对抗性的,可以任意行事,导致以下紧张局势:虽然协作有可能减少遗憾,但也可能扰乱对手的学习过程。在这项工作中,我们通过设计新的算法,平衡探索-开发交易,通过精心构建的稳健互信间隔来平衡勘探-开发交易。我们还以严格的分析来补充我们的算法。首先,我们开发了一个强有力的合作分阶段消除算法,实现美元(alpha+1/sqrt{Märight)的对抗性,并可以任意行事,从而导致出现以下两种情况:一方面,合作可以减少遗憾;另一方面,美元是模式-分散,美元是用来破坏对手的学习过程;另一方面,对于小部分,我们的结果表明尽管存在对手,合作的好处是明显的。使用信息理论论论论,我们随后证明匹配了一个较低的约束,从而提供了第一套紧凑的、近似的分级的清除算算算法,从而使得每个好的页面的直线性直线性图能够大大利用我们建立高层次的直观的直观的直径图。