How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues $\unicode{x2014}$ particularly salient in continuous action spaces. We propose Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on $K$-armed and continuous combinatorial bandit algorithms, including a green security domain using real poaching data. We show the practical benefits of Artificial Replay, including for base algorithms that do not satisfy IIData.
翻译:如何最好地将历史数据纳入“温热启动”土匪算法是一个尚未解决的问题:利用所有历史样本天真地开始计算奖励估计数,可能会受到虚假数据和数据覆盖不平衡的影响,从而导致计算和存储问题$\uncode{x2014}$在连续动作空间中尤为突出。我们提议人工重播,这是一个元值,用于将历史数据纳入任意的基础土匪算法中。人工重播只使用历史数据的一小部分,而采用全热启动方法,而对于满足不相关数据独立性的基础算法(IIData)却仍实现同样的遗憾,这是我们引入的一种新颖和广泛应用的属性。我们将这些理论结果补充了以$K$武装的连续组合土匪算法的实验,包括使用实际偷猎数据的绿色安全域。我们展示了人工重编的实际好处,包括不满足 IIData 的基础算法。