Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.
翻译:Markov 决策程序(MDPs)是连续决策中常用的正式模型。 MDPs捕捉到在过渡功能中通过概率概率变化产生的不精确的动因产生的随机性,然而,在数据驱动的应用中,从(有限)数据得出准确的概率,引入了可能导致意外或不良结果的统计错误。不确定的 MDPs(UMDPs)并不需要精确的概率,而是使用过渡过程中所谓的不确定数据集,为这种有限数据进行核算。正式核查界的工具有效地计算出强有力的政策,在不确定性下最坏的例子下,如安全限制,可以明显地遵守正式的规格。我们不断学习一个MDP的过渡概率,采用稳健的随时学习方法,将专门的Bayesian推论计划与稳健的政策计算结合起来。特别是,我们的方法(1) 比较概率,(2) 适应可能与中间模型不一致的新数据,(3) 随时可能停止对一个可靠的政策(例如安全限制)进行精确的计算,在确定不确定性的最坏情况实例中,例如安全限制。我们不断学习MDP的过渡方案的过渡性概率,我们通过一个可靠的实验性分析方法,我们所学到的加强的BMDP政策,从而能很好地测量到一个精确地评估。