Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have "no weighted-regret" in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.
翻译:当玩家从$T 回合中选择每回合的美元动作时, 当玩家从$T 回合中选择每回合的美元, 并观察在拖延 $+$ 回合后发生的成本。 成本函数和延迟序列由对手选择。 我们显示, 在不合作的游戏中, 如果玩家使用在上述情景中“ 不加权- regret” 的算法, 即使他们由于太多的延迟而有线性遗憾 。 对于两玩家零和游戏, 我们显示, 任何加权- regret 的游戏均不足以让加权的游戏平均值与 Nash equiliblibraria 组合。 我们证明, 如果玩家使用“ 没有加权- regdiscrit ” 的游戏, 则FKM 算算算算出“ 没有加权- regretretretal ” 组合, 也可以使用“ ligreal- drequetal $@ral\\\\\\\\\\\\ k rmax lax lax lax a lax lax lax a lax lax) un a lix, lax lix a lix lib lib lib lex lib lib lib lib lib lib lib lib lib lib 。