When two players are engaged in a repeated game with unknown payoff matrices, they may use single-agent multi-armed bandit algorithms to choose the actions independent of each other. We show that when the players use Thompson sampling, the game dynamics converges to the Nash equilibrium under a mild assumption on the payoff matrices. Therefore, algorithmic collusion doesn't arise in this case despite the fact that the players do not intentionally deploy competitive strategies. To prove the convergence result, we find that the framework developed in stochastic approximation doesn't apply, because of the sporadic and infrequent updates of the inferior actions and the lack of Lipschitz continuity. We develop a novel sample-path-wise approach to show the convergence. However, when the payoff matrices do not satisfy the assumption, the game may converge to collusive outcomes.
翻译:暂无翻译