We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on the non-stationary kernel bandit setting. Moreover, when applied to the non-stationary linear bandit setting by using a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.
翻译:我们建议对非静止内核强盗采用一种算法,这种算法不需要事先知道非静止内核的程度。算法遵循通过解决平衡勘探和开采的优化问题而获得的随机化战略。当发现奖励功能的改变时,这种算法通过重新启动适应非静止内核强盗。我们的算法享有比以前关于非静止内核强盗设置的工作更紧密的动态遗憾。此外,在使用线性内核来应用非静止线性线性强盗设置时,我们的算法几乎是最理想的,解决了非静止线性线性强盗文献中的一个开放问题。我们扩展了我们的算法,用神经网络来动态调整地貌图以适应所观察到的数据。我们证明使用神经正核内核理论的扩展是动态的遗憾。我们从经验上证明我们的算法和扩展可以适应不同程度的非静止性。