We study the problem of training a principal in a multi-agent general-sum game using reinforcement learning (RL). Learning a robust principal policy requires anticipating the worst possible strategic responses of other agents, which is generally NP-hard. However, we show that no-regret dynamics can identify these worst-case responses in poly-time in smooth games. We propose a framework that uses this policy evaluation method for efficiently learning a robust principal policy using RL. This framework can be extended to provide robustness to boundedly rational agents too. Our motivating application is automated mechanism design: we empirically demonstrate our framework learns robust mechanisms in both matrix games and complex spatiotemporal games. In particular, we learn a dynamic tax policy that improves the welfare of a simulated trade-and-barter economy by 15%, even when facing previously unseen boundedly rational RL taxpayers.
翻译:我们用强化学习(RL)来研究在多试剂一般和游戏中培训一名校长的问题。学习一项强有力的主要政策需要预测其他代理方(一般为NP-硬性)可能采取的最坏战略对策。然而,我们显示,在平滑游戏中,没有回报的动态可以在多时确定这些最坏的对策。我们提出了一个框架,利用这一政策评价方法有效学习一项强有力的主要政策。这个框架可以扩展,为约束性理性代理商提供强健的强健性。我们的激励应用是自动化机制设计:我们从经验上表明,我们的框架在矩阵游戏和复杂的时空游戏中都学习了强有力的机制。特别是,我们学到了一种动态税收政策,它能改善模拟的贸易和易货经济福利15 %, 即便在面对以前看不见的、具有内在联系的合理RL纳税人时也是如此。