In a Stackelberg game, a leader commits to a randomized strategy, and a follower chooses their best strategy in response. We consider an extension of a standard Stackelberg game, called a discrete-time dynamic Stackelberg game, that has an underlying state space that affects the leader's rewards and available strategies and evolves in a Markovian manner depending on both the leader and follower's selected strategies. Although standard Stackelberg games have been utilized to improve scheduling in security domains, their deployment is often limited by requiring complete information of the follower's utility function. In contrast, we consider scenarios where the follower's utility function is unknown to the leader; however, it can be linearly parameterized. Our objective then is to provide an algorithm that prescribes a randomized strategy to the leader at each step of the game based on observations of how the follower responded in previous steps. We design a no-regret learning algorithm that, with high probability, achieves a regret bound (when compared to the best policy in hindsight) which is sublinear in the number of time steps; the degree of sublinearity depends on the number of features representing the follower's utility function. The regret of the proposed learning algorithm is independent of the size of the state space and polynomial in the rest of the parameters of the game. We show that the proposed learning algorithm outperforms existing model-free reinforcement learning approaches.
翻译:在Stackelberg游戏中, 领导者承诺执行随机化战略, 并且由追随者选择最佳策略。 我们考虑的是标准Stackelberg游戏的扩展, 称为离散时间动态Stackelberg游戏, 其基本状态空间影响领导者的奖赏和可用策略, 并根据领导者和追随者所选择的战略, 以Markovian方式演变。 尽管标准Stackelberg游戏已被用于改进安全域的时间安排, 但其部署往往受到限制, 需要跟随者工具功能的完整信息。 相反, 我们考虑的是, 标准Stackelberg游戏的扩展, 称为离散时间动态Stackelberg游戏, 称为离散时间动态动态动态动态的扩展, 但是, 它可以线性地标度参数化。 我们的目标是提供一种算法, 根据观察追随者对前几个步骤的反应, 以马尔科博格游戏的任意化策略演进。 我们设计了一个无差别的学习算法, 其模式非常可能实现一种遗憾( 当与后期最佳政策相比) 。 在时间线内, 该游戏的游戏的学习步骤的排序参数参数参数中, 将代表着学习模式的演进度的演进程度。