A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner's dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated "always defect" policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.
翻译:越来越多的计算研究表明,简单的机器学习智能体在社会困境中会收敛至合作行为,例如寡头垄断市场中的共谋定价行为,这引发了关于该结果驱动因素的疑问。在本工作中,我们为自博弈多智能体Q学习者在迭代囚徒困境中的这一现象提供了理论基础。我们刻画了此类智能体可证明地学会合作性巴甫洛夫(赢留输移)策略而非帕累托次优的"始终背叛"策略的广泛条件。通过补充实验验证了理论结果,证明了其在更广泛的深度学习算法类别中的鲁棒性。