We study the problem of corralling stochastic bandit algorithms, that is combining multiple bandit algorithms designed for a stochastic environment, with the goal of devising a corralling algorithm that performs almost as well as the best base algorithm. We give two general algorithms for this setting, which we show benefit from favorable regret guarantees. We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward, and depends on the gap between the highest reward and other rewards.
翻译:我们研究的是串通强盗算法的问题,即将设计用于随机环境的多个强盗算法结合起来,目的是设计一种几乎和最佳基本算法一样的串通算法。我们为这一环境提供了两种一般算法,我们从有利于悔恨的保证中得到好处。我们证明,这种串通算法的遗憾并不比以最高奖赏包含手臂的最佳算法更糟糕,并且取决于最高奖赏与其他奖赏之间的差距。