适应对称多武装强盗中的延迟和数据 (Adapting to Delays and Data in Adversarial Multi-Armed Bandits)

We consider the adversarial multi-armed bandit problem under delayed feedback. We analyze variants of the Exp3 algorithm that tune their step-size using only information (about the losses and delays) available at the time of the decisions, and obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses. First, through a remarkably simple proof technique, we show that with proper tuning of the step size, the algorithm achieves an optimal (up to logarithmic factors) regret of order $\sqrt{\log(K)(TK + D)}$ both in expectation and in high probability, where $K$ is the number of arms, $T$ is the time horizon, and $D$ is the cumulative delay. The high-probability version of the bound, which is the first high-probability delay-adaptive bound in the literature, crucially depends on the use of implicit exploration in estimating the losses. Then, following Zimmert and Seldin [2019], we extend these results so that the algorithm can "skip" rounds with large delays, resulting in regret bounds of order $\sqrt{TK\log(K)} + |R| + \sqrt{D_{\bar{R}}\log(K)}$, where $R$ is an arbitrary set of rounds (which are skipped) and $D_{\bar{R}}$ is the cumulative delay of the feedback for other rounds. Finally, we present another, data-adaptive (AdaGrad-style) version of the algorithm for which the regret adapts to the observed (delayed) losses instead of only adapting to the cumulative delay (this algorithm requires an a priori upper bound on the maximum delay, or the advance knowledge of the delay for each decision when it is made). The resulting bound can be orders of magnitude smaller on benign problems, and it can be shown that the delay only affects the regret through the loss of the best arm.

翻译：我们从延迟反馈中考虑对抗性多武装土匪问题。我们只使用决策时可用的信息( 有关损失和延迟) 分析调控其步数的Exp3 算法变量。我们通过一个非常简单的验证技术, 显示通过适当调整步数大小, 算法能够实现一个最佳( 直至对数系数) 调控 $sqrt rlog (K) (TK + D) 的遗憾。在预期和高概率两方面, 我们分析调控 Exp3 算法变量的变异变量, 只有 $( TK + 延迟) 才能调控, 并且 $( R+ 延迟 ) 时间和 $( T) 时间, 美元是累积时间的累积延迟。捆绑定的高度概率版本是文献中的第一个高概率延迟度, 关键取决于在估算损失时使用隐含的勘探。然后, 在Zimmett 和 Seldin ( RQ) 和 Selden (W) ) 之后, 我们扩展这些结果, 因此, 需要“ Rp 轮 r_ lader\\ lax lex lex lex lex lex lex lex led) 。