In bandits with distribution shifts, one aims to automatically detect an unknown number $L$ of changes in reward distribution, and restart exploration when necessary. While this problem remained open for many years, a recent breakthrough of Auer et al. (2018, 2019) provide the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, with no knowledge of $L$. However, not all distributional shifts are equally severe, e.g., suppose no best arm switches occur, then we cannot rule out that a regret $O(\sqrt{T})$ may remain possible; in other words, is it possible to achieve dynamic regret that optimally scales only with an unknown number of severe shifts? This unfortunately has remained elusive, despite various attempts (Auer et al., 2019, Foster et al., 2020). We resolve this problem in the case of two-armed bandits: we derive an adaptive procedure that guarantees a dynamic regret of order $\tilde{O}(\sqrt{\tilde{L} T})$, where $\tilde L \ll L$ captures an unknown number of severe best arm changes, i.e., with significant switches in rewards, and which last sufficiently long to actually require a restart. As a consequence, for any number $L$ of distributional shifts outside of these severe shifts, our procedure achieves regret just $\tilde{O}(\sqrt{T})\ll \tilde{O}(\sqrt{LT})$. Finally, we note that our notion of severe shift applies in both classical settings of stochastic switching bandits and of adversarial bandits.
翻译:在分布变换的土匪中,一个目标是自动检测一个未知的美元 { 美元 { 分配变化的金额 { 分配变化 { 美元 { 分配变化的金额 {, 并在必要时重新开始勘探 。 这个问题虽然多年来一直存在, 但最近Auer等人( 2018, 2019) 的突破提供了第一个适应程序, 保证美元( 动力) 最佳( 动力) 遗憾 $ Qrt{ 立特 $, 并且不知道$ 。 然而, 不是所有分配变换都同样严重, 例如, 假设没有出现最佳的手臂开关, 那么我们不能排除一个遗憾 $( ) ; 换掉 换掉 ; 换掉 换掉, 可能实现 最佳的 最佳规模? 可惜, 尽管做出了各种尝试( 亚瑟 、 2019 、 福斯特 和 2020 ), 这个问题仍然难以解决 。 我们从两条土匪中获取一个适应程序, 保证这些顺序的动态后悔, 。