In this work, we ask for and answer what makes classical temporal-difference reinforcement learning with epsilon-greedy strategies cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. We use the iterated Prisoner's dilemma with one-period memory as a testbed. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.
翻译:在这项工作中,我们要求并回答什么使传统的时间差异强化学习成为传统时间差异强化学习与epsilon-greedy战略合作。在社会困境中开展合作对于动物、人类和机器至关重要。进化理论揭示了一系列促进合作的机制,而代理商学会合作的条件则存在争议。在这里,我们展示了多试剂学习环境的哪些要素和各个要素是如何导致合作的。我们使用循环式囚犯的两难困境,用一个时期的记忆作为测试。两个学习代理商都学习了一种战略,以最后一轮的代理商行动选择为以下行动选择的条件。我们发现,对于未来的奖励、低勘探率和低学习率来说,下一个是高度关心的,而对于强化学习过程的内在波动,使最后合作率翻倍至80%。因此,内在的噪音不是迭代学习过程的必要邪恶。这是学习的关键资产。我们还指出,合作行为的可能性很高,而在合理的时间里实现这一作用,我们发现,我们的结果是相关的。我们制定共同的算法,目的是为了调整合作性的目的。