State-of-the-art methods for solving 2-player zero-sum imperfect information games rely on linear programming or regret minimization, though not on dynamic programming (DP) or heuristic search (HS), while the latter are often at the core of state-of-the-art solvers for other sequential decision-making problems. In partially observable or collaborative settings (e.g., POMDPs and Dec- POMDPs), DP and HS require introducing an appropriate statistic that induces a fully observable problem as well as bounding (convex) approximators of the optimal value function. This approach has succeeded in some subclasses of 2-player zero-sum partially observable stochastic games (zs- POSGs) as well, but how to apply it in the general case still remains an open question. We answer it by (i) rigorously defining an equivalent game to work with, (ii) proving mathematical properties of the optimal value function that allow deriving bounds that come with solution strategies, (iii) proposing for the first time an HSVI-like solver that provably converges to an $\epsilon$-optimal solution in finite time, and (iv) empirically analyzing it. This opens the door to a novel family of promising approaches complementing those relying on linear programming or iterative methods.
翻译:解决玩家零和不完善信息游戏的最先进方法取决于线性编程或最小化遗憾最小化,尽管不依赖于动态编程(DP)或超光速搜索(HS),而后者往往是其他相继决策问题的最新解决者的核心。在部分可观察或协作的环境中(例如POMDPs和Dec-POMDPs)、DP和HS需要引入适当的统计,以引起完全可见的问题以及最佳价值功能的约束(civex)相近者。这个方法在一些小类中取得了成功,包括2个玩家零和部分可观测的随机游戏(HS-POSGs),但在一般情况下如何应用它仍然是一个尚未解决的问题。我们回答的办法是:(一)严格界定一个与工作相当的游戏,(二)证明最佳价值功能的数学性质,从而可以得出解决方案战略的界限,(c)首次提议一个类似于HSVI的解算器,以可预见的方式在可预见的新模式上将这种模式和可预见的新模式作为基础。