As an important psychological and social experiment, the Iterated Prisoner's Dilemma (IPD) treats the choice to cooperate or defect as an atomic action. We propose to study the behaviors of online learning algorithms in the Iterated Prisoner's Dilemma (IPD) game, where we investigate the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We evaluate them based on a tournament of iterated prisoner's dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated, as an effort to connect artificial intelligence algorithms with human behaviors and their abnormal states in neuropsychiatric conditions.
翻译:作为一种重要的心理和社会实验,迭代囚犯困境(IPD)将合作或缺陷的选择视为原子行动。我们提议研究在迭代囚犯困境(IPD)游戏中在线学习算法的行为,我们在这里调查各种强化学习代理人:多武装强盗、背景强盗和强化学习。我们根据迭代囚犯困境的竞赛来评估他们,其中多个代理人可以相继竞争。这使我们能够分析多种自利独立奖赏推动者所学的政策动态,并使我们能够研究这些算法适应人类行为的能力。结果显示,在这种社会两难的游戏中,考虑当前决策情况是最糟糕的。关于在线学习行为和临床鉴定的多重发现,作为将人工智能算法与人类行为及其神经精神病条件下的异常状态联系起来的一种努力。