We propose a bandit algorithm that explores purely by randomizing its past observations. In particular, the sufficient optimism in the mean reward estimates is achieved by exploiting the variance in the past observed rewards. We name the algorithm Capitalizing On Rewards (CORe). The algorithm is general and can be easily applied to different bandit settings. The main benefit of CORe is that its exploration is fully data-dependent. It does not rely on any external noise and adapts to different problems without parameter tuning. We derive a $\tilde O(d\sqrt{n\log K})$ gap-free bound on the $n$-round regret of CORe in a stochastic linear bandit, where $d$ is the number of features and $K$ is the number of arms. Extensive empirical evaluation on multiple synthetic and real-world problems demonstrates the effectiveness of CORe.
翻译:我们建议一种纯粹通过随机计算过去观测结果来探索的土匪算法。 特别是, 利用过去观察到的收益差异就能实现平均报酬估计数中足够乐观的乐观。 我们命名了“ 资本升值” 算法(CORe ) 。 算法是一般性的,可以很容易地应用于不同的土匪设置。 CORe的主要好处是其探索完全依赖数据。 它不依赖任何外部噪音,而是适应不同的问题,而不进行参数调控。 我们从一个Stochectic线条带中取出一个$-tilde O(d\sqrt{n\log K} ($-log K}), CORE 的全方位遗憾中得出了美元($-troll)的零差值, 其中美元是特性的数量, 美元是武器的数量。 对多种合成和实际问题的广泛经验评估显示了CORE 的有效性 。