We study nonstochastic bandits and experts in a delayed setting where delays depend on both time and arms. While the setting in which delays only depend on time has been extensively studied, the arm-dependent delay setting better captures real-world applications at the cost of introducing new technical challenges. In the full information (experts) setting, we design an algorithm with a first-order regret bound that reveals an interesting trade-off between delays and losses. We prove a similar first-order regret bound also for the bandit setting, when the learner is allowed to observe how many losses are missing. These are the first bounds in the delayed setting that depend on the losses and delays of the best arm only. When in the bandit setting no information other than the losses is observed, we still manage to prove a regret bound through a modification to the algorithm of Zimmert and Seldin (2020). Our analyses hinge on a novel bound on the drift, measuring how much better an algorithm can perform when given a look-ahead of one round.
翻译:在拖延取决于时间和武器的情况下,我们研究非随机强盗和专家。虽然已经对拖延仅取决于时间的环境进行了广泛研究,但依靠手臂的拖延会以新的技术挑战为代价,更好地捕捉现实世界应用。在完整的信息(专家)环境下,我们设计了一种带有第一级遗憾的算法,它揭示了拖延和损失之间的一个有趣的权衡。我们证明,对于强盗环境来说,我们也有类似的第一级遗憾,它允许学习者观察损失了多少。这是延后环境的第一个界限,它仅取决于最佳手臂的损失和延误。在强盗中,除了观察损失之外,没有任何信息,我们仍然设法通过修改Zimmert和Seldin的算法(202020年)来证明遗憾。我们的分析取决于关于漂移的一小说,衡量在给一回合前看时算法能做得更好得多。