In this note, we revisit non-stationary linear bandits, a variant of stochastic linear bandits with a time-varying underlying regression parameter. Existing studies develop various algorithms and show that they enjoy an $\widetilde{O}(T^{2/3}(1+P_T)^{1/3})$ dynamic regret, where $T$ is the time horizon and $P_T$ is the path-length that measures the fluctuation of the evolving unknown parameter. However, we discover that a serious technical flaw makes the argument ungrounded. We revisit the analysis and present a fix. Without modifying original algorithms, we can prove an $\widetilde{O}(T^{3/4}(1+P_T)^{1/4})$ dynamic regret for these algorithms, slightly worse than the rate as was anticipated. We also show some impossibility results for the key quantity concerned in the regret analysis. Note that the above dynamic regret guarantee requires an oracle knowledge of the path-length $P_T$. Combining the bandit-over-bandit mechanism, we can also achieve the same guarantee in a parameter-free way.
翻译:在本说明中,我们重新审视了非静止线性强盗,这是一种具有时间变化基础回归参数的随机线性强盗的变种。现有的研究发展了各种算法,并显示他们享有美元(全图){O}(T ⁇ 2/3}(1+P_T)1/3}(1+P_T)美元)的动态遗憾,而美元(P_T)是衡量变化中的未知参数波动的路径长度。然而,我们发现,一个严重的技术缺陷使得争论失去依据。我们重新审视了分析并提出了一个修正。如果不修改原始算法(T ⁇ 3/4}(T ⁇ 3/4}(1+P_T)1/4}),我们也可以证明这些算法的美元动态遗憾,略低于预期的速率。我们还显示了在遗憾分析中所涉关键数量的一些不可能的结果。请注意,以上动态遗憾保证需要对路径长度($P_T)的奥奇知识。将土匪跨带机制结合起来,我们也可以以无参数的方式实现同样的保证。