We introduce algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary linear stochastic bandit setting. It captures natural applications such as dynamic pricing and ads allocation in a changing environment. We show how the difficulty posed by the non-stationarity can be overcome by a novel marriage between stochastic and adversarial bandits learning algorithms. Defining $d,B_T,$ and $T$ as the problem dimension, the \emph{variation budget}, and the total time horizon, respectively, our main contributions are the tuned Sliding Window UCB (\texttt{SW-UCB}) algorithm with optimal $\widetilde{O}(d^{2/3}(B_T+1)^{1/3}T^{2/3})$ dynamic regret, and the tuning free bandit-over-bandit (\texttt{BOB}) framework built on top of the \texttt{SW-UCB} algorithm with best $\widetilde{O}(d^{2/3}(B_T+1)^{1/4}T^{3/4})$ dynamic regret.
翻译:我们引入了能够达到非静止线性线性土匪设置状态的算法 。 它捕捉了动态定价和广告分配等自然应用, 在变化的环境中。 我们展示了如何通过随机和对抗性土匪学习算法之间的新型结合来克服非静态造成的困难。 定义了 $d, B_ T, $ 和$T 问题维度、 emph{ 变量预算} 和总时间跨度, 我们的主要贡献是调整的滑动窗口 UCB (\ ttt{ SW- UCB}) 算法, 其最佳值为 $\\ 全局{O} (d ⁇ 2/3} (B_+1) 1/3} T} 动态遗憾, 以及调制自由土匪-超频带框架 (\ ttt{BB} 框架, 建在\ texttt{SW- UCB} 顶端, 我们的主要贡献是调制的滑动窗口 UCB (\\) 3} (d\\\\\ 4} Strigrolex} (B} (B} (_ 4} (B} _ _ 4} _ +1) 4} 4} (d\ 1 4} 4} 4} 4} 4} 调调调调调调调制自由的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调制的调算法框架。