Consider a player that in each round $t$ out of $T$ rounds chooses an action and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that even if the players' algorithms lose their "no regret" property due to too large delays, the expected discounted ergodic distribution of play converges to the set of coarse correlated equilibrium (CCE) if the algorithms have "no discounted-regret". For a zero-sum game, we show that no discounted-regret is sufficient for the discounted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves a regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves a regret of $O\left(\sqrt{\ln K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that EXP3 and FKM have no discounted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, the CCE of a finite or convex unknown game can be approximated even when only delayed bandit feedback is available via simulation.
翻译:将一个玩家认为, 在每轮$T$的回合中, $t$t 选择一个动作, 并观察在延迟美元回合后发生的成本。 成本函数和延迟序列由对手选择 。 我们显示, 即使玩家的算法由于太多的延迟而失去了“ 不后悔” 属性, 游戏的预期折价分配会与一套粗正相关平衡( CPE) 相交 。 对于零和游戏, 我们显示, 折价 Regret在延迟的回合中并不足够。 成本函数和延迟序列的游戏平均折价。 我们证明, 带有美元维值的 FKM 算法实现了 $left (nT ⁇ ferc{3 ⁇ 4 ⁇ sqrqrt} prockt}Tfercreckeqt right=rloral=x$t。 liverclex liket: likeferal_\\\\\\\\\\\\\\\\ krxxxx lix lex lex a lex lex a lex lex lex lex) lex lex lex lexn。 当使用这些算值时, $t\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\