We consider a scenario in which two reinforcement learning agents repeatedly play a matrix game against each other and update their parameters after each round. The agents' decision-making is transparent to each other, which allows each agent to predict how their opponent will play against them. To prevent an infinite regress of both agents recursively predicting each other indefinitely, each agent is required to give an opponent-independent response with some probability at least epsilon. Transparency also allows each agent to anticipate and shape the other agent's gradient step, i.e. to move to regions of parameter space in which the opponent's gradient points in a direction favourable to them. We study the resulting dynamics experimentally, using two algorithms from previous literature (LOLA and SOS) for opponent-aware learning. We find that the combination of mutually transparent decision-making and opponent-aware learning robustly leads to mutual cooperation in a single-shot prisoner's dilemma. In a game of chicken, in which both agents try to manoeuvre their opponent towards their preferred equilibrium, converging to a mutually beneficial outcome turns out to be much harder, and opponent-aware learning can even lead to worst-case outcomes for both agents. This highlights the need to develop opponent-aware learning algorithms that achieve acceptable outcomes in social dilemmas involving an equilibrium selection problem.
翻译:我们考虑两种强化学习剂反复相互对立的矩阵游戏,并在每回合后更新参数。 代理人的决策对彼此透明, 使每个代理人能够预测对手如何对立。 为了防止两个代理人的无限退步, 双方必须反复预测对方, 每一个代理人必须给对方一个互相独立的反应, 可能至少是伊西隆。 透明度还允许每个代理人预测和塑造另一个代理人的梯度步骤, 即向对方的梯度点向有利的方向移动到参数空间区域。 我们实验研究由此产生的动态, 利用以往文献( LOLA 和 SOS) 中的两种算法( LOLA 和 SOS ) 来学习对方的对手。 我们发现, 相互透明的决策和对对手的认知相结合, 能够有力地在单手囚犯的困境中进行相互合作。 在鸡的游戏中, 双方试图操纵对方的对手, 走向他们喜欢的平衡, 凝聚到一个对彼此有利的结果。 我们通过实验来研究由此产生的动态, 使用两种算法( LOLA 和 SOS ) 来学习最坏的进进阶段 。