Learning in general-sum games is unstable and frequently leads to socially undesirable (Pareto-dominated) outcomes. To mitigate this, Learning with Opponent-Learning Awareness (LOLA) introduced opponent shaping to this setting, by accounting for each agent's influence on their opponents' anticipated learning steps. However, the original LOLA formulation (and follow-up work) is inconsistent because LOLA models other agents as naive learners rather than LOLA agents. In previous work, this inconsistency was suggested as a cause of LOLA's failure to preserve stable fixed points (SFPs). First, we formalize consistency and show that higher-order LOLA (HOLA) solves LOLA's inconsistency problem if it converges. Second, we correct a claim made in the literature by Sch\"afer and Anandkumar (2019), proving that Competitive Gradient Descent (CGD) does not recover HOLA as a series expansion (and fails to solve the consistency problem). Third, we propose a new method called Consistent LOLA (COLA), which learns update functions that are consistent under mutual opponent shaping. It requires no more than second-order derivatives and learns consistent update functions even when HOLA fails to converge. However, we also prove that even consistent update functions do not preserve SFPs, contradicting the hypothesis that this shortcoming is caused by LOLA's inconsistency. Finally, in an empirical evaluation on a set of general-sum games, we find that COLA finds prosocial solutions and that it converges under a wider range of learning rates than HOLA and LOLA. We support the latter finding with a theoretical result for a simple game.
翻译:普通游戏中的学习不稳定,常常导致社会上不受欢迎的(帕雷托占主导地位的)结果。为了缓解这一点,通过计算每个代理人对其对手预期的学习步骤的影响,通过计算每个代理人对各自对手预期的学习步骤的影响,学习与学习运动(LOLA)引入了反对者来塑造这一环境。然而,最初的LOLA提法(和后续工作)前后不一致,因为LOLA将其他代理人作为幼稚的学习者而不是LOLA代理商来模拟其他代理人。在以往的工作中,这种不一致被认为是LOLA未能保持稳定的固定点(SFPs)的原因。第一,我们正式确定一致性,并表明高等级LOLA(HOLA)如果相互一致,则解决了LOLA的不一致问题。第二,我们纠正了在文献中由S\"afer和Andkumar(2019年)提出的主张,证明竞争性梯根(CGD)不能将HOL(CGD)作为系列扩展(无法解决一致性问题)。第三,我们建议一种名为COLA(COA)的趋同性(COLLA)的新方法,我们找到一个称为“一致的趋同性LLLLA(COA),它学会(COA),它学会(HOLA(HOL)的更新功能甚至在相互一致的功能下不断更新功能下,在不断更新的功能下,而不能使LFPA)不能使LFPA的功能得到不断更新。