Although Thompson sampling is widely used in stationary environments, it does not effectively account for nonstationarities. To address this limitation, we propose predictive sampling, a policy that balances between exploration and exploitation in nonstationary bandit environments. It is equivalent to Thompson sampling when specialized to stationary environments, but much more effective across a range of nonstationary environments because it deprioritizes investment in acquiring information that will quickly lose relevance. To offer insight in the efficacy of predictive sampling, we establish a regret bound. This bound highlights dependence on the rate at which new information arrives to alter the environment. In addition, we conduct experiments on bandit environments with varying rates of information arrival and observe that predictive sampling outperforms Thompson sampling.
翻译:虽然汤普森取样在固定环境中广泛使用,但并没有有效地说明非静止性。为解决这一限制,我们提出了预测性取样这一平衡非固定性土匪环境中勘探和开发的政策。它相当于专门针对固定性环境的汤普森取样,但在一系列非固定性环境中效果更大,因为它取消了对获取将很快失去相关性的信息的投资的优先次序。为了对预测性取样的功效进行深入了解,我们确立了一种遗憾的界限。它突出了对新信息到达改变环境的速度的依赖性。此外,我们还对信息到达速度各不相同的土匪环境进行了实验,并观察到预测性取样优于汤普森取样。