We investigate the repeated prisoner's dilemma game where both players alternately use reinforcement learning to obtain their optimal memory-one strategies. We theoretically solve the simultaneous Bellman optimality equations of reinforcement learning. We find that the Win-stay Lose-shift strategy, the Grim strategy, and the strategy which always defects can form symmetric equilibrium of the mutual reinforcement learning process amongst all deterministic memory-one strategies.
翻译:我们调查了屡次囚犯的两难游戏,即双方选手轮流使用强化学习来获得最佳的记忆一体战略。 我们理论上解决了同时同时出现的加强学习的贝尔曼最佳方程式。 我们发现,Win-stay Loste-轮班战略、格林战略以及总是有缺陷的战略可以形成所有确定性记忆一型战略之间相互强化学习过程的对称平衡。