We propose predictive sampling as an approach to selecting actions that balance between exploration and exploitation in nonstationary bandit environments. When specialized to stationary environments, predictive sampling is equivalent to Thompson sampling. However, predictive sampling is effective across a range of nonstationary environments in which Thompson sampling suffers. We establish a general information-theoretic bound on the Bayesian regret of predictive sampling. We then specialize this bound to study a modulated Bernoulli bandit environment. Our analysis highlights a key advantage of predictive sampling over Thompson sampling: predictive sampling deprioritizes investments in exploration where acquired information will quickly become less relevant.
翻译:我们提出预测抽样,作为选择在非静止强盗环境中进行勘探和开发之间平衡的行动的一种方法。当专门为固定环境进行预测抽样时,预测抽样相当于汤普森取样。然而,预测抽样在汤普森取样所受影响的一系列非静止环境中是有效的。我们在贝叶斯人对预测抽样的遗憾上建立了一个一般的信息理论约束。然后我们专门研究一个调制的伯努利强盗环境。我们的分析突出了预测抽样相对于汤普森取样的主要优势:预测抽样使对勘探的投资失去优先地位,而获得的信息将很快变得不太相关。