Real world applications such as economics and policy making often involve solving multi-agent games with two unique features: (1) The agents are inherently asymmetric and partitioned into leaders and followers; (2) The agents have different reward functions, thus the game is general-sum. The majority of existing results in this field focuses on either symmetric solution concepts (e.g. Nash equilibrium) or zero-sum games. It remains vastly open how to learn the Stackelberg equilibrium -- an asymmetric analog of the Nash equilibrium -- in general-sum games efficiently from samples. This paper initiates the theoretical study of sample-efficient learning of the Stackelberg equilibrium, in the bandit feedback setting where we only observe noisy samples of the reward. We consider three representative two-player general-sum games: bandit games, bandit-reinforcement learning (bandit-RL) games, and linear bandit games. In all these games, we identify a fundamental gap between the exact value of the Stackelberg equilibrium and its estimated version using finitely many noisy samples, which can not be closed information-theoretically regardless of the algorithm. We then establish sharp positive results on sample-efficient learning of Stackelberg equilibrium with value optimal up to the gap identified above, with matching lower bounds in the dependency on the gap, error tolerance, and the size of the action spaces. Overall, our results unveil unique challenges in learning Stackelberg equilibria under noisy bandit feedback, which we hope could shed light on future research on this topic.
翻译:经济学和政策制定等现实世界应用往往涉及解决具有两个独特特点的多试剂游戏,如经济学和政策制定等: (1) 代理人本质上是不对称的,分成领导者和追随者; (2) 代理人有不同的奖励功能,因此游戏是一般和。 本领域现有的多数结果侧重于对称解决方案概念(如纳什均衡)或零和游戏。 它仍然非常开放,可以学习斯塔克勒贝格平衡 -- -- 与普通和普通游戏中纳什平衡的不对称类似 -- -- 与普通和普通游戏的有效比对。 本文发起理论研究,以抽样反馈为基础,学习Stackelberg平衡的抽样效率,学习Stackelberg平衡的抽样效率,我们只观察奖赏的噪音样本。 我们考虑三种有代表性的双人一般和普通游戏: 土匪游戏、 土匪- 累加力学习(bandit-RL) 游戏和线形土游戏。 在所有这些游戏中,我们可以找出斯塔克勒贝里比平衡平衡的准确价值与其估计版本之间的根本差距。 本文启动理论比较的样本,我们无法关闭信息- —— 不论算法如何缩小。 然后,我们用最精确的平局上最精确的平差的研究, 我们用最精确的比 学习的平级的平级的平级的平级的平级的平级的平级的平调。