The problem of bandit with graph feedback generalizes both the multi-armed bandit (MAB) problem and the learning with expert advice problem by encoding in a directed graph how the loss vector can be observed in each round of the game. The mini-max regret is closely related to the structure of the feedback graph and their connection is far from being fully understood. We propose a new algorithmic framework for the problem based on a partition of the feedback graph. Our analysis reveals the interplay between various parts of the graph by decomposing the regret to the sum of the regret caused by small parts and the regret caused by their interaction. As a result, our algorithm can be viewed as an interpolation and generalization of the optimal algorithms for MAB and learning with expert advice. Our framework unifies previous algorithms for both strongly observable graphs and weakly observable graphs, resulting in improved and optimal regret bounds on a wide range of graph families including graphs of bounded degree and strongly observable graphs with a few corrupted arms.
翻译:使用图形反馈的土匪问题概括了多臂土匪问题和专家咨询学习问题,在图表中编码如何在每轮游戏中观测损失矢量。 迷你最大遗憾与反馈图的结构及其连接关系密切相关, 远未完全理解。 我们基于反馈图的分割, 提出了这一问题的新算法框架。 我们的分析通过将遗憾与小部分造成的遗憾和它们相互作用造成的遗憾之和分解, 揭示了图中各个部分之间的相互作用。 因此, 我们的算法可以被视为对移动矢量的最佳算法的内推法和概括, 以及用专家建议学习。 我们的框架统一了以前对强烈可观察的图表和微弱可观察的图表的算法, 从而改进和优化了对广大图表系列的遗憾界限, 包括受约束程度的图表和少数受腐蚀的武器的强烈可观察的图表。