Artificial intelligence and robotic competitions are accompanied by a class of game paradigms in which each player privately commits a strategy to a game system which simulates the game using the collected joint strategy and then returns payoffs to players. This paper considers the strategy commitment for two-player symmetric games in which the players' strategy spaces are identical and their payoffs are symmetric. First, we introduce two digraph-based metrics at a meta-level for strategy evaluation in two-agent reinforcement learning, grounded on sink equilibrium. The metrics rank the strategies of a single player and determine the set of strategies which are preferred for the private commitment. Then, in order to find the preferred strategies under the metrics, we propose two variants of the classical learning algorithm self-play, called strictly best-response and weakly better-response self-plays. By modeling learning processes as walks over joint-strategy response digraphs, we prove that the learnt strategies by two variants are preferred under two metrics, respectively. The preferred strategies under both two metrics are identified and adjacency matrices induced by one metric and one variant are connected. Finally, simulations are provided to illustrate the results.
翻译:人工智能和机器人竞赛伴随着一系列游戏模式,每个玩家在其中私下对游戏系统做出一项战略,利用收集到的联合战略模拟游戏,然后将报酬回报给玩家。本文审议了玩家战略空间相同和其报酬对称的双玩对称游戏的战略承诺。首先,我们在基于汇平衡的两个试剂强化学习中,为战略评估引入了两个元级的基于字典的衡量标准。衡量标准将单个玩家的战略排在首位,并确定一套适合私人承诺的战略。然后,为了在衡量标准下找到首选的战略,我们提出了两种典型学习算法自我游戏的变式,即严格称最佳反应和反应能力较差的自我游戏。通过在联合战略响应矩阵中行走来模拟学习过程,我们证明在两个衡量标准下分别选择了两个变式的学习战略。两种衡量标准下的首选战略被确定,两个衡量标准下的对应矩阵则由一个计量标准和一个变式驱动。最后,模拟的结果被连接到一个矩阵和一个变式。