We investigate the repeated prisoner's dilemma game where both players alternately use reinforcement learning to obtain their optimal memory-one strategies. We theoretically solve the joint Bellman optimum equations of reinforcement learning. We find that the Win-stay Lose-shift strategy, the Grim strategy, and the strategy which always defects can form symmetric equilibrium of the mutual reinforcement learning process amongst sixteen deterministic strategies.
翻译:我们调查了囚犯反复的两难困境游戏,即双方选手轮流利用强化学习获得最佳的记忆一体战略。 我们理论上解决了加强学习的“贝尔曼”最佳共同公式。 我们发现,Win-stay Loste-轮班战略、格林战略以及总是有缺陷的策略可以形成16个决定性策略之间相互强化学习过程的对称平衡。