A classic solution technique for Markov decision processes (MDP) and stochastic games (SG) is value iteration (VI). Due to its good practical performance, this approximative approach is typically preferred over exact techniques, even though no practical bounds on the imprecision of the result could be given until recently. As a consequence, even the most used model checkers could return arbitrarily wrong results. Over the past decade, different works derived stopping criteria, indicating when the precision reaches the desired level, for various settings, in particular MDP with reachability, total reward, and mean payoff, and SG with reachability. In this paper, we provide the first stopping criteria for VI on SG with total reward and mean payoff, yielding the first anytime algorithms in these settings. To this end, we provide the solution in two flavours: First through a reduction to the MDP case and second directly on SG. The former is simpler and automatically utilizes any advances on MDP. The latter allows for more local computations, heading towards better practical efficiency. Our solution unifies the previously mentioned approaches for MDP and SG and their underlying ideas. To achieve this, we isolate objective-specific subroutines as well as identify objective-independent concepts. These structural concepts, while surprisingly simple, form the very essence of the unified solution.
翻译:价值迭代(Value Iteration,VI)是解决马尔可夫决策过程(MDP)和随机博弈(SG)的经典方法之一。由于其良好的实用性能,通常比精确技术更受青睐,即使直到最近也无法给出结果不准确程度的实用范围内界限。因此,即使是使用最多的模型检查器也可能返回任意错误的结果。在过去十年中,不同的工作针对不同设置(特别是具有可达性、总奖励和平均回报的MDP和具有可达性的SG)导出了停止准则。在本文中,我们为具有总奖励和平均回报的SG提供了第一个VI停止准则,为这些设置提供了第一个任意时算法。为此,我们提供了两种解决方案:一是通过将其化简为MDP的情况从而更简单地实现,“自动”利用了MDP的任何进展,二是直接在SG上提供更本地化的计算,从而实现更好的实用效率。我们的解决方案统一了MDP和SG的前述方法及其基本思想。为了实现这一目标,我们隔离了特定于目标的子程序,并确定了目标无关的概念。这些结构概念,虽然出人意料地简单,却是统一解决方案的核心。