While standard bandit algorithms sometimes incur high regret, their performance can be greatly improved by "warm starting" with historical data. Unfortunately, how best to incorporate historical data is unclear: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues - particularly in continuous action spaces. We address these two challenges by proposing Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a subset of the historical data as needed to reduce computation and storage. We show that for a broad class of base algorithms that satisfy independence of irrelevant data (IIData), a novel property that we introduce, our method achieves equal regret as a full warm-start approach while potentially using only a fraction of the historical data. We complement these theoretical results with a case study of $K$-armed and continuous combinatorial bandit algorithms, including on a green security domain using real poaching data, to show the practical benefits of Artificial Replay in achieving optimal regret alongside low computational and storage costs.
翻译:虽然标准的土匪算法有时会引起极大的遗憾,但其性能可以通过历史数据“温暖开始”来大大改善。不幸的是,如何最好地纳入历史数据并不明确:利用所有历史样本对奖赏估算进行天真初始化,可能会受到虚假的数据和不平衡的数据覆盖的影响,导致计算和储存问题,特别是在连续的行动空间。我们通过提出人工回放,即将历史数据纳入任何任意的基础土匪算法的元值算法来应对这两项挑战。人工回放仅使用用于减少计算和储存所需的一组历史数据。我们表明,对于满足不相关数据独立性的大规模基础算法(IIData),我们引入的一种新颖的属性,我们的方法作为完全热力启动的方法实现了同等的遗憾,同时可能只使用部分历史数据。我们用一个案例研究来补充这些理论结果,即用美元武装和连续组合的团算法进行案例研究,包括使用实际偷取数据进行绿色安全域的案例研究,以显示人工再演算法在与低计算和存储成本同时实现最佳遗憾方面的实际好处。