We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest $d$ rounds. Previous works focus on oblivious adversarial setting, while we investigate the harder non-oblivious setting. We show non-oblivious setting incurs $\Omega(T)$ pseudo regret even when the loss sequence is bounded memory. However, we propose a wrapper algorithm which enjoys $o(T)$ policy regret on many adversarial bandit problems with the assumption that the loss sequence is bounded memory. Especially, for $K$-armed bandit and bandit convex optimization, we have $\mathcal{O}(T^{2/3})$ policy regret bound. We also prove a matching lower bound for $K$-armed bandit. Our lower bound works even when the loss sequence is oblivious but the delay is non-oblivious. It answers the open problem proposed in \cite{wang2021adaptive}, showing that non-oblivious delay is enough to incur $\tilde{\Omega}(T^{2/3})$ regret.
翻译:我们用复合匿名延迟反馈来研究对抗性土匪问题。 在这种环境下, 行动的损失被分割成美元的组成部分, 在选择行动后连续几轮。 在每轮中, 算法观察最新的美元回合产生的损失汇总情况。 先前的工作重点是模糊的对抗环境, 而我们调查较难的非显眼环境。 我们显示非显眼的设置导致$\ Omega( T) 的伪遗憾, 即使损失序列与内存有关。 但是, 我们提出一个包装算法, 在许多对抗性土匪问题上享有$( T) 的政策遗憾, 并在假设损失序列是约束性记忆的情况下, 。 特别是, $( K) 武装的土匪和土匪的 convex 优化, 我们有$\ mathcal{O} (T+2/3} 政策遗憾。 我们还证明, $( T) 和$( K) 手持土匪的比下低。 我们较低的约束算得更低。 即使在损失序列为模糊, 但是延迟也是不明显的。 它解了一个公开的问题, 显示的是, $2_\\\\\\\\\\\\\\\\\\\\}