This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. More precisely, it achieves best-of-three-worlds regret bounds, i.e., of ${O}(\sqrt{T \log T})$ for adversarial environments and of $O(\frac{\log T}{\Delta_{\min}} + \sqrt{\frac{C \log T}{\Delta_{\min}}})$ for stochastic environments with adversarial corruptions, where $T$, $\Delta_{\min}$, and $C$ denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption. Note that polynomial factors in the dimensionality are omitted here. At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better. The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action, the total quadratic variation, and the path-length of the loss vector sequence. In addition, for stochastic environments, the proposed algorithm has a variance-adaptive regret bound of $O(\frac{\sigma^2 \log T}{\Delta_{\min}})$ as well, where $\sigma^2$ denotes the maximum variance of the feedback loss. The proposed algorithm is based on the SCRiBLe algorithm. By incorporating into this a new technique we call scaled-up sampling, we obtain high-level adaptability, and by incorporating the technique of optimistic online learning, we obtain low-level adaptability.
翻译:本文建议了一种适合不同等级层次环境的线性土匪算法。 在较高层次, 提议的算法适应了各种类型的环境。 更准确地说, 它可以达到三种世界的最佳遗憾界限, 即: ${O}( sqrt{T\ t\log T}) 美元, 用于对抗环境, 和 $(\\frac\log TunDelta<unk> <unk> +\ sqrt=frac{C\log TunDelda<unk> min} 。 在较高层次, 用于有对抗腐败、 $( $, $, Delta<unk> } $, $, 和 $( C$, ) 最坏的三世界最坏的框框框框框框框。 注意这里忽略了维度的多元因素。 在较低层次, 我们的对抗性和 度制度下, 拟议的算法适应了某些环境特征, 从而表现得更好。 提议的算法有数据- 递增量的轨道, 将技术的累积性变变 。</s>