Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. Theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.
翻译:汤普森取样证明在一系列的固定强盗环境中是有效的。然而,正如我们在本文中所表明的那样,当应用于非静止环境时,它的表现可能很差。我们表明,这种失败是由于以下事实造成的:在探索时,算法没有根据获得的信息由于非静止性而丧失其效用的速度来区分行动。我们提出预测性取样,这种算法使获取迅速失去用处的信息变得不优先。关于预测性取样的运行的理论保证是通过贝叶西亚的遗憾来建立的。我们提供了预测性取样的版本,可以对具有实际兴趣的复杂强盗环境进行可分量的计算。我们通过数字模拟,证明预测性取样在所有非静止环境中都优于汤普森取样。</s>