We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), which is a class of bandit algorithms that leverages adaptive windowing techniques from the data stream community. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest in the data mining community. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when the abrupt or global changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we restrict the interest to the global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.
翻译:我们考虑的是非静止的多武装土匪问题,因为随着时间推移武器变化的模型参数。我们引入了适应性重新定型土匪(ADR-bandit),这是一种利用数据流社区适应性窗口技术的一类土匪算法。我们首先对适应性窗口技术产生的估计者的质量提供新的保障,这些技术是数据采矿界独立感兴趣的。此外,我们还在两个典型环境中对ADR-bandit进行了有限的时间分析:即瞬间变化的突发环境,以及逐渐变化的环境。我们证明,当突变或全球变化以协调的方式发生时,ADR-bandit几乎具有最佳性能,我们称之为全球变化。我们证明,当我们限制对全球变化的兴趣时,强制探索是不必要的。与现有的非静止型土匪算法不同,ADR-bidit在固定环境中和与全球变化的非静止环境中的表现最优。我们的实验表明,拟议的算法超出了合成和现实世界环境中的现有方法。