We consider online learning with feedback graphs, a sequential decision-making framework where the learner's feedback is determined by a directed graph over the action set. We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both stochastic and adversarial environments. The bound against oblivious adversaries is $\tilde{O} (\sqrt{\alpha T})$, where $T$ is the time horizon and $\alpha$ is the independence number of the feedback graph. The bound against stochastic environments is $O\big( (\ln T)^2 \max_{S\in \mathcal I(G)} \sum_{i \in S} \Delta_i^{-1}\big)$ where $\mathcal I(G)$ is the family of all independent sets in a suitably defined undirected version of the graph and $\Delta_i$ are the suboptimality gaps. The algorithm combines ideas from the EXP3++ algorithm for stochastic and adversarial bandits and the EXP3.G algorithm for feedback graphs with a novel exploration scheme. The scheme, which exploits the structure of the graph to reduce exploration, is key to obtain best-of-both-worlds guarantees with feedback graphs. We also extend our algorithm and results to a setting where the feedback graphs are allowed to change over time.
翻译:我们考虑用反馈图进行在线学习,这是一个顺序决策框架,学习者的反馈由一组行动的定向图表确定。我们在此框架中提出一个计算效率高的学习算法,在这个框架中,既在随机环境,又在对抗环境中同时达到接近最佳的遗憾界限。对不明对手的约束是$\tilde{O}(\qrt=alpha T})(sqrt\alpha T})美元,其中美元为时间范围,美元是反馈图的独立数字。针对随机环境的约束是$O\Big((lnT)%2\max=S\\ in\mathcal I(G)}\sum_i\ in S}\Delta_i\\_i\\\\\\ ⁇ big) $。美元是所有独立组合的组合,在图表的不直接定义的图表中, $\Delta_i美元是亚美度差距。 算法结合了 EXP3++gall 的理念和 Extalogimal 结构, 也是我们对新图的图像的图像的更新和图像的更新的图像的更新的图像。