We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is poly$(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\widetilde{O}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.
翻译:我们考虑在一组对抗性强盗算法中结合和学习的问题,其目的在于适应性地追踪最佳的飞行用量。Agarwal等人(2017年)及其变体(Foster等人,2020年a)的CORAL算法(2017年)及其变体(Foster等人,2020年a)的CORRAAL算法,以令人遗憾的美元顺序($Uberstilde{O}}(sqrt{MT})美元)实现这个目标,美元是基本算法的数量,美元是T$的时段。但是,多价对M$多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价多价。