We consider minimisation of dynamic regret in non-stationary bandits with a slowly varying property. Namely, we assume that arms' rewards are stochastic and independent over time, but that the absolute difference between the expected rewards of any arm at any two consecutive time-steps is at most a drift limit $\delta > 0$. For this setting that has not received enough attention in the past, we give a new algorithm which extends naturally the well-known Successive Elimination algorithm to the non-stationary bandit setting. We establish the first instance-dependent regret upper bound for slowly varying non-stationary bandits. The analysis in turn relies on a novel characterization of the instance as a detectable gap profile that depends on the expected arm reward differences. We also provide the first minimax regret lower bound for this problem, enabling us to show that our algorithm is essentially minimax optimal. Also, this lower bound we obtain matches that of the more general total variation-budgeted bandits problem, establishing that the seemingly easier former problem is at least as hard as the more general latter problem in the minimax sense. We complement our theoretical results with experimental illustrations.
翻译:我们考虑在非静止强盗中尽量减少动态的遗憾,因为其财产变化缓慢。 也就是说, 我们假设军火的奖励是随机的, 并且随着时间的推移是独立的, 但是在任何连续两个时间步骤中,任何手臂的预期报酬之间的绝对差别最多是一个漂移限度 $\delta > 0$。 对于过去没有受到足够重视的这种环境, 我们给出一种新的算法, 自然地将众所周知的连续消除算法扩大到非静止强盗的设置。 我们为缓慢变化的非静止强盗建立了首先依赖案例的遗憾上限。 反过来, 分析依赖于对这种情况的新的描述, 将它描述为取决于预期的手臂报酬差异的可探测差距剖面。 我们还为这一问题提供了第一个迷你Max低端的遗憾, 使我们能够表明我们的算法基本上是最小型的。 另外, 我们得到的这种较低约束, 我们得到的匹配了更普遍的、 全部变换率的强盗贼问题, 确定看起来比较容易的问题至少与最一般的后一个问题一样难。 我们用实验性的说明来补充我们的理论结果。