In general-sum games, the interaction of self-interested learning agents commonly leads to collectively worst-case outcomes, such as defect-defect in the iterated prisoner's dilemma (IPD). To overcome this, some methods, such as Learning with Opponent-Learning Awareness (LOLA), shape their opponents' learning process. However, these methods are myopic since only a small number of steps can be anticipated, are asymmetric since they treat other agents as naive learners, and require the use of higher-order derivatives, which are calculated through white-box access to an opponent's differentiable learning algorithm. To address these issues, we propose Model-Free Opponent Shaping (M-FOS). M-FOS learns in a meta-game in which each meta-step is an episode of the underlying ("inner") game. The meta-state consists of the inner policies, and the meta-policy produces a new inner policy to be used in the next episode. M-FOS then uses generic model-free optimisation methods to learn meta-policies that accomplish long-horizon opponent shaping. Empirically, M-FOS near-optimally exploits naive learners and other, more sophisticated algorithms from the literature. For example, to the best of our knowledge, it is the first method to learn the well-known Zero-Determinant (ZD) extortion strategy in the IPD. In the same settings, M-FOS leads to socially optimal outcomes under meta-self-play. Finally, we show that M-FOS can be scaled to high-dimensional settings.
翻译:在一般游戏中,自我感兴趣的学习代理人的相互作用通常导致集体最坏的结果,如在迭代囚犯的两难困境(IPD)中出现缺陷。为了克服这一点,我们建议采用一些方法,例如“学习与积极学习意识(LOLA) ” 等方法,来塑造对手的学习过程。然而,这些方法是短视的,因为可以预见到只有一小部分步骤,它们不对称,因为它们把其他代理人当作天真的学习者,并且需要使用高阶衍生品,这些衍生品通过白箱访问对方不同的学习算法来计算。为了解决这些问题,我们建议采用“无偏向偏向偏向偏向” 。MFOS在元游戏中学习一个元性游戏,每个元性步骤都是基础游戏(“内”)的一个插曲。元状态是由内部政策构成的,而元政策产生一种新的内部政策,将在下一个插曲中使用。M-FOS然后使用通用的无型优化模式方法来学习能够完成长期的元政策。为了解决这些问题,我们建议,M-FO-FE-FO在最接近于最高级的游戏中,在最高级的游戏中,在最高级的游戏中,在最高级的学习中可以展示中学习。M-FO-FI-toimal-toimal-toimal-h-h-h-h-h-h-h-h-his-h-his-hisma-hisma-hisal-his-his-hism-to-to-to-h-h-h-h-in-h-h-how-h-h-h-in-in-in-h-h-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-h-h-h-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-in-