We consider the Scale-Free Adversarial Multi Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent before-hand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose an algorithm named \texttt{SFBanker} for this novel setting, which combines a recent banker online mirror descent technique and elaborately designed doubling tricks. We show that \texttt{SFBanker} achieves $\mathcal O(\sqrt{K(D+T)}L)\cdot {\rm polylog}(T, L)$ total regret, where $T$ is the total number of steps and $D$ is the total feedback delay. \texttt{SFBanker} also outperforms existing algorithm for non-delayed (i.e., $D=0$) scale-free adversarial MAB problem instances. We also present a variant of \texttt{SFBanker} for problem instances with non-negative losses (i.e., they range in $[0, L]$ for some unknown $L$), achieving an $\tilde{\mathcal O}(\sqrt{K(D+T)}L)$ total regret, which is near-optimal compared to the $\Omega(\sqrt{KT}+\sqrt{D\log K}L)$ lower-bound ([Cesa-Bianchi et al., 2016]).
翻译:我们认为无限制反向多武装盗匪(MAB)问题与无限制反馈延迟(MAB) 问题。 标准假设所有损失都在我们的设置中 $[0, 1美元, 以美元为限, 代理人面前不知道, 损失在一般的约束间隔 $[L, L] 中会下降。 此外, 每一次手臂拉动的反馈都会经历任意的延误 。 我们为这个新的设置提议了一个名为\ texttt{SFBanker的算法, 它结合了最新的银行家在线镜底下行技术和精心设计的双倍技巧。 我们显示, 所有损失都是 $\ ttt{SFB} 达到 $(sqtt$) 的平面 O(D=0美元) 。 相对而言, L&L) cdottr=Lx(美元), 也代表着一个不为您提供数字的折叠数的 O. (xxxxxx) 。 (xxxxxxxxxxxx) 问题(xxxxxal- legal mal mAB sax。