We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.
翻译:我们考虑的是无Astial Adversarial 多武装盗匪(MAB) 问题, 以及不受限制的反馈延迟。 与标准假设相比, 所有损失都是 $[0, 1美元, 在我们的设置中, 损失可能会在一般的受约束间隔 $[L, L], 代理方之前未知。 此外, 每个臂拉的反馈可能会经历任意的延误。 我们为这个新环境提出了一个名为 无限制延迟 INF (SFD) 的新办法, 它结合了最新的“ comx 组合技巧 ” 和新颖的翻番和跳的技巧。 然后我们展示了 SFD- INF 的两例, 每例都有精心设计的延迟适应的学习尺度 。 SFD- TINF 使用$\ framcle, 并可以实现全局的 Oqtral O} (sqraltial) 。 当损失是非内(K美元) 的动作数量时, 美元是最低的, 而 美元 AL_ 直立方的回。