关于领导-跟随者随机博弈的动态规划理论 (On Dynamic Programming Theory for Leader-Follower Stochastic Games)

Leader-follower general-sum stochastic games (LF-GSSGs) model sequential decision-making under asymmetric commitment, where a leader commits to a policy and a follower best responds, yielding a strong Stackelberg equilibrium (SSE) with leader-favourable tie-breaking. This paper introduces a dynamic programming (DP) framework that applies Bellman recursion over credible sets-state abstractions formally representing all rational follower best responses under partial leader commitments-to compute SSEs. We first prove that any LF-GSSG admits a lossless reduction to a Markov decision process (MDP) over credible sets. We further establish that synthesising an optimal memoryless deterministic leader policy is NP-hard, motivating the development of ε-optimal DP algorithms with provable guarantees on leader exploitability. Experiments on standard mixed-motive benchmarks-including security games, resource allocation, and adversarial planning-demonstrate empirical gains in leader value and runtime scalability over state-of-the-art methods.

翻译：领导-跟随者一般和随机博弈（LF-GSSGs）建模了非对称承诺下的序贯决策过程，其中领导者承诺一项策略，跟随者以最优响应，从而产生一个具有领导者有利平局处理的强斯塔克尔伯格均衡（SSE）。本文引入了一个动态规划（DP）框架，该框架在可信集（一种状态抽象，形式化地表示在部分领导者承诺下所有理性跟随者最优响应）上应用贝尔曼递归来计算SSE。我们首先证明任何LF-GSSG都可以无损地约简为一个在可信集上的马尔可夫决策过程（MDP）。我们进一步确立了合成一个最优的无记忆确定性领导者策略是NP难的，这推动了开发具有可证明领导者可剥削性保证的ε最优DP算法。在标准混合动机基准测试（包括安全博弈、资源分配和对抗规划）上的实验表明，相较于最先进方法，在领导者价值和运行时可扩展性方面取得了经验性提升。