We propose `Banker-OMD`, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The `Banker-OMD` framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus allowing the easy design of new algorithms capable of easily and robustly handling feedback delays. Specifically, it offers a general methodology for achieving $\tilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret bounds in online bandit learning tasks with delayed feedback, where $T$ is the number of rounds and $D$ is the total feedback delay. We demonstrate the power of \texttt{Banker-OMD} by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. `Banker-OMD` leads to the first delayed scale-free adversarial MAB algorithm achieving $\tilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret and the first delayed adversarial linear bandit algorithm achieving $\tilde{\mathcal O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret. As a corollary, the first application also implies $\tilde{\mathcal O}(\sqrt{KT}L)$ regret for non-delayed scale-free adversarial MABs, which is the first to match the $\Omega(\sqrt{KT}L)$ lower bound up to logarithmic factors and can be of independent interest.
翻译:我们提议“银行-OMD ”, 这是在在线学习文献中推广经典在线镜底(OMD)技术的新框架。 “银行-OMD” 框架几乎完全脱去反馈延迟处理和任务特定的 OM 算法设计, 从而可以方便地设计能够容易和有力地处理反馈延误的新算法。 具体地说, 它提供了一种总的方法, 用于实现$\ tdel_mathal O}( sqrt{T} +\ sqqrt{ t} +\ sqrt{D} 。 美元( banker- markrt{D} 风格的遗憾在网上的学习任务中, 延迟反馈反馈, 其中美元是回合数, 美元是美元。 我们展示了在两个重要的带宽度学习情景上应用的能量, 包括延迟的无规模对抗性多Armed bits (MAB) 和延迟的线性诈骗。 $( bankerker-OMD} 也意味着第一次延迟的MAB orthal_ral_Oral}