In this paper, we consider the Gaussian process (GP) bandit optimization problem in a non-stationary environment. To capture external changes, the black-box function is allowed to be time-varying within a reproducing kernel Hilbert space (RKHS). To this end, we develop WGP-UCB, a novel UCB-type algorithm based on weighted Gaussian process regression. A key challenge is how to cope with infinite-dimensional feature maps. To that end, we leverage kernel approximation techniques to prove a sublinear regret bound, which is the first (frequentist) sublinear regret guarantee on weighted time-varying bandits with general nonlinear rewards. This result generalizes both non-stationary linear bandits and standard GP-UCB algorithms. Further, a novel concentration inequality is achieved for weighted Gaussian process regression with general weights. We also provide universal upper bounds and weight-dependent upper bounds for weighted maximum information gains. These results are potentially of independent interest for applications such as news ranking and adaptive pricing, where weights can be adopted to capture the importance or quality of data. Finally, we conduct experiments to highlight the favorable gains of the proposed algorithm in many cases when compared to existing methods.
翻译:在本文中, 我们考虑高山进程( GP) 土匪优化在非静止环境中的问题。 为了捕捉外部变化, 黑盒功能允许在复制的内核Hilbert空间( RKHS) 中进行时间变化。 为此, 我们开发了新颖的以加权高山进程回归为基础的UCB型算法。 关键的挑战是如何应对无限的地貌图。 为此, 我们利用内核近流技术来证明亚线性遗憾捆绑, 这是对加权时间变化的匪徒的第一个( 重) 子线性遗憾保证, 这是对加权非线性线性强盗和标准的GP- UCB算法的普遍化。 此外, 在加权高空进程回归和一般权重回归方面, 也实现了新的集中不平等。 我们还为加权最大信息收益提供了通用的上限和权重上限。 这些结果对于诸如新闻排名和适应性定价等应用可能具有独立的兴趣, 其权重可以用来测量现有数据收益的重要性或质量。 最后, 我们强调在比较现有数据实验中, 将现有方法的偏重性。