We introduce a reinforcement learning framework for economic design problems. We model the interaction between the designer of the economic environment and the participants as a Stackelberg game: the designer (leader) sets up the rules, and the participants (followers) respond strategically. We model the followers via no-regret dynamics, which converge to a Bayesian Coarse-Correlated Equilibrium (B-CCE) of the game induced by the leader. We embed the followers' no-regret dynamics in the leader's learning environment, which allows us to formulate our learning problem as a POMDP. We call this POMDP the Stackelberg POMDP. We prove that the optimal policy of the Stackelberg POMDP achieves the same utility as the optimal leader's strategy in our Stackelberg game. We solve the Stackelberg POMDP using an actor-critic method, where the critic can access the joint information of all agents. Finally, we show that we are able to learn optimal leader strategies in a variety of settings, including scenarios where the leader is participating in or designing normal-form games, as well as settings with incomplete information that capture common aspects of indirect mechanism design such as limited communication and turn-taking play by agents.
翻译:我们引入了经济设计问题强化学习框架。我们将经济环境设计者和参与者之间的互动模式建模为Stackelberg游戏:设计者(领导者)制定规则,参与者(追随者)做出战略反应。我们通过无雷动态模型构建追随者,这与领导人引领的游戏中的巴伊西亚-科萨-科科科相关电子平衡(B-CCE)相融合。我们将追随者无雷动态嵌入领导者的学习环境中,这使我们能够将我们的学习问题发展成一个POMDP。我们称这个POMDP为Stackelberg POMDP。我们证明,Stackelberg POMDP的最佳政策与我们Stakelberg游戏中的最佳领导者战略具有同样的效用。我们用一个演员-celberg POMDP(B-CE) 方法解决了Stackelberg POMDP(BOMDP), 使评论家能够获取所有代理人的联合信息。最后, 我们表明,我们能够在各种环境中学习最佳的领导策略,包括领导者参与的场景象,或者设计正常设计游戏,作为普通的游戏的场景象,作为不完全的游戏,作为普通的游戏的场景象。