Multiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. One approach, Policy-Space Response Oracles (PSRO), employs standard reinforcement learning to compute response policies via approximate best responses and combines them via meta-strategy selection. We augment PSRO by adding a novel search procedure with generative sampling of world states, and introduce two new meta-strategy solvers based on the Nash bargaining solution. We evaluate PSRO's ability to compute approximate Nash equilibrium, and its performance in two negotiation games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where human participants negotiate with our agents ($N = 346$). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.
翻译:多剂强化学习(MARL)大大受益于基于人口和游戏理论的培训制度。 一种方法 — — 政策-空间反应甲骨文(PSRO) — — 采用标准强化学习来计算反应政策,通过最接近的最佳反应来计算反应政策,并通过元战略选择将其结合起来。 我们通过添加世界各州基因抽样的新搜索程序来增强PSRO,并根据纳什讨价还价方案引入两个新的元战略解决方案。 我们评估了PSRO计算近似纳什平衡的能力及其在两场谈判游戏中的表现:有色轨迹和交易或无交易。 我们进行行为研究,让人类参与者与我们的代理商谈判(N=346美元 ) 。 我们发现,在培训时间和测试时间里,以基因模型模型搜索找到更强有力的政策,使Bayesian 的在线共同玩家预测得以实现与人类进行类似社会福利谈判的代理商,作为人之间的交易。