Using simulations between pairs of $\epsilon$-greedy q-learners with one-period memory, this article demonstrates that the potential function of the stochastic replicator dynamics (Foster and Young, 1990) allows it to predict the emergence of error-proof cooperative strategies from the underlying parameters of the repeated prisoner's dilemma. The observed cooperation rates between q-learners are related to the ratio between the kinetic energy exerted by the polar attractors of the replicator dynamics under the grim trigger strategy. The frontier separating the parameter space conducive to cooperation from the parameter space dominated by defection can be found by setting the kinetic energy ratio equal to a critical value, which is a function of the discount factor, $f(\delta) = \delta/(1-\delta)$, multiplied by a correction term to account for the effect of the algorithms' exploration probability. The gradient at the frontier increases with the distance between the game parameters and the hyperplane that characterizes the incentive compatibility constraint for cooperation under grim trigger. Building on literature from the neurosciences, which suggests that reinforcement learning is useful to understanding human behavior in risky environments, the article further explores the extent to which the frontier derived for q-learners also explains the emergence of cooperation between humans. Using metadata from laboratory experiments that analyze human choices in the infinitely repeated prisoner's dilemma, the cooperation rates between humans are compared to those observed between q-learners under similar conditions. The correlation coefficients between the cooperation rates observed for humans and those observed for q-learners are consistently above $0.8$. The frontier derived from the simulations between q-learners is also found to predict the emergence of cooperation between humans.
翻译:使用 $\ epsilon$greedy q- Learner 的对配方之间模拟 $\ epsilon$ greedy q- Learner 的模拟, 此文章显示, 蒸汽复制器动态( Foster and Young, 1990) 的潜在功能使得它能够根据囚犯反复进犯困境的基本参数预测出防错误的合作战略的出现。 q- Learner 之间观察到的合作率与极极地吸引者在严酷的触发战略下, 振动中有利于合作的参数空间与由叛变主导的参数空间之间的距离, 可以通过将动态能量比对等于关键值( Foster and Young, 1990) 的动态复制机能比值( Fostera and Young, 1990) 使它能够根据反复的囚犯进犯进犯进犯的参数, $f(delta) =\ delta/ (1- delta) 乘以校正术语来计算算算算算算算算算法的概率。 边界参数参数参数参数和高空局之间的距离越差越差越差越差越差越差越差, 越差越差越差越差越差越差越差越差越差越差越差越好, 越差越差越差越好合作的参数空间空间, 合作的参数空间空间空间 。 越差越差越差越差越远的人类- 越好越好的合作 越好从那些从神经越好从神经越好的人类- 从神经越差越好, 从神经越好, 从神经越好, 从神经越好越好越好越好越好越好越好越好越好的人类- 从神经越好越好越好越好越好越好越好越好越好越好越好越好越好越好越好越好越好越好越好越好。