Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to nonstationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to nonstationarity. Building upon this insight, we propose predictive sampling, which extends Thompson sampling to do this. We establish a Bayesian regret bound and establish that, in nonstationary bandit environments, the regret incurred by Thompson sampling can far exceed that of predictive sampling. We also present implementations of predictive sampling that scale to complex bandit environments of practical interest in a computationally tractable manner. Through simulations, we demonstrate that predictive sampling outperforms Thompson sampling and other state-of-the-art algorithms across a wide range of nonstationary bandit environments.
翻译:汤普森取样证明在一系列的固定强盗环境中是有效的。然而,正如我们在本文中所表明的那样,当应用到非静止环境中时,它表现得很差。我们表明,这种失败是由于以下事实造成的:在探索时,算法没有根据获得的信息由于非静止性而迅速失去效用而区分行动。我们根据这一观察,提出预测性取样,扩大汤普森取样的范围,从而扩大汤普森取样的范围。我们建立了巴耶斯式取样,确定在非静止强盗环境中,汤普森取样的遗憾远远超过预测性取样的遗憾。我们还介绍了预测性取样的实施情况,该取样规模以可计算的方式扩大到具有实际兴趣的复杂强盗环境。我们通过模拟,证明预测性取样超越了汤普森取样和其他最先进的算法,覆盖了广泛的非静止强盗环境。