We study Stackelberg games where a principal repeatedly interacts with a long-lived, non-myopic agent, without knowing the agent's payoff function. Although learning in Stackelberg games is well-understood when the agent is myopic, non-myopic agents pose additional complications. In particular, non-myopic agents may strategically select actions that are inferior in the present to mislead the principal's learning algorithm and obtain better outcomes in the future. We provide a general framework that reduces learning in presence of non-myopic agents to robust bandit optimization in the presence of myopic agents. Through the design and analysis of minimally reactive bandit algorithms, our reduction trades off the statistical efficiency of the principal's learning algorithm against its effectiveness in inducing near-best-responses. We apply this framework to Stackelberg security games (SSGs), pricing with unknown demand curve, strategic classification, and general finite Stackelberg games. In each setting, we characterize the type and impact of misspecifications present in near-best-responses and develop a learning algorithm robust to such misspecifications. Along the way, we improve the query complexity of learning in SSGs with $n$ targets from the state-of-the-art $O(n^3)$ to a near-optimal $\widetilde{O}(n)$ by uncovering a fundamental structural property of such games. This result is of independent interest beyond learning with non-myopic agents.
翻译:我们研究Stackelberg游戏,其中一位校长反复与一个长期的、非海洋的代理人互动,而不知道代理人的回报功能。虽然在Stackelberg游戏中学习是完全理解的,但当代理人是短视时,非海洋的代理人就会产生更多的复杂问题。特别是,非海洋代理人可以从战略上选择一些在目前低级的行动,以误导校长的学习算法,并在将来获得更好的结果。我们提供了一个总体框架,减少在非海洋代理人在场的情况下学习,以便在有近视代理人在场的情况下,实现强力的土匪优化。通过设计和分析最低限度反应的土匪算法,我们减少该代理人学习算法的统计效率,以接近最佳的反应为代价。我们把这个框架应用到Sckelberg安全游戏(SSG),以未知的需求曲线定价、战略分类和一般的有限Skelberg游戏。在每次设定时,我们用近最佳反应中的不精确度来描述现有不精确的类型和影响,并发展一种不精确的基调的代理人的基调算法,我们从Sq 3 的深度学习结果改进了Sqour 美元。