We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve low regret, they do not exploit the matrix structure and perform poorly relative to the new algorithms. The main contributions are regret analyses of variants of UCB and K-learning that hold for any opponent, e.g., even when the opponent adversarially plays the best-response to the learner's mixed strategy. Along the way, we show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.
翻译:我们研究经典零和矩阵游戏的版本,其报酬矩阵和强盗反馈未知,玩家只观察对方的行动和吵闹的回报。这概括了通常的矩阵游戏,玩家们知道报酬矩阵。尽管有许多应用,但这个问题相对没有引起注意。虽然对抗性强盗算法取得了低度的遗憾,但它们并没有利用矩阵结构,并且与新的算法相比表现不佳。主要贡献是对UCB和K学习的变量进行遗憾分析,这些变量支持任何对手,例如,即使敌对对手对学习者的混合策略发挥最佳反应。顺便一提,我们表明Thompson在这种环境下没有灾难性的失败,并且提供了与现有算法的实验性比较。