普通运动会Stackelberg平衡的抽样有效学习 (Sample-Efficient Learning of Stackelberg Equilibria in General-Sum Games)

Real world applications such as economics and policy making often involve solving multi-agent games with two unique features: (1) The agents are inherently asymmetric and partitioned into leaders and followers; (2) The agents have different reward functions, thus the game is general-sum. The majority of existing results in this field focuses on either symmetric solution concepts (e.g. Nash equilibrium) or zero-sum games. It remains open how to learn the Stackelberg equilibrium -- an asymmetric analog of the Nash equilibrium -- in general-sum games efficiently from noisy samples. This paper initiates the theoretical study of sample-efficient learning of the Stackelberg equilibrium, in the bandit feedback setting where we only observe noisy samples of the reward. We consider three representative two-player general-sum games: bandit games, bandit-reinforcement learning (bandit-RL) games, and linear bandit games. In all these games, we identify a fundamental gap between the exact value of the Stackelberg equilibrium and its estimated version using finitely many noisy samples, which can not be closed information-theoretically regardless of the algorithm. We then establish sharp positive results on sample-efficient learning of Stackelberg equilibrium with value optimal up to the gap identified above, with matching lower bounds in the dependency on the gap, error tolerance, and the size of the action spaces. Overall, our results unveil unique challenges in learning Stackelberg equilibria under noisy bandit feedback, which we hope could shed light on future research on this topic.

翻译：经济学和政策制定等现实世界应用往往涉及解决具有两个独特特点的多试剂游戏,如经济学和政策制定等: (1) 代理人本质上是不对称的,分成领导者和追随者; (2) 代理人有不同的奖励功能,因此游戏是一般和。本领域现有的多数结果侧重于对称解决方案概念(如纳什均衡)或零和游戏。它仍然开放,如何学习斯塔克勒贝格平衡 -- -- 与纳什平衡的不对称类比 -- -- 普通游戏中,从吵闹的样本中有效学习纳什平衡。本文在强盗反馈环境中,开始理论研究Stackelberg平衡的抽样效率学习,我们只观察悬浮的奖励样本。我们考虑三种有代表性的双人总和游戏: 匪帮游戏、土匪拉特加拉特- 硬带学习( 黑带- RL) 游戏和线形强盗版游戏。在所有这些游戏中,我们发现斯塔克克尔堡平衡的准确价值与其估计版本之间的根本差距。本文用很多的杂度样本, 我们无法关闭信息- ―― ―― ――不管是什么算。