Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. Furthermore, our method is capable of adapting to changes in the environment. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.
翻译:Markov 决策程序(MDPs)是连续决策中常用的正式模型。 MDPs捕捉到在过渡功能中通过概率概率变化产生的不精确的动因产生的随机性,然而,在数据驱动的应用中,从(有限)数据得出准确的概率引入了可能导致意外或不良结果的统计错误。不确定的 MDPs(uMDPs)并不需要精确的概率,而是在过渡过程中使用所谓的不确定数据集,对此类有限数据进行核算。正式核查界的工具高效地计算出强有力的政策,在不确定性下最坏的例子下,如安全限制等,这些工具可能符合正式的规格。我们不断学习数据驱动器的过渡概率,采用稳健的随时学习方法,将专门的Bayesian推论计划与稳健的政策计算结合起来。特别是,我们的方法(1) 与概率相近,(2) 适应可能与中间模型不一致的新数据,(3) 可能随时停止对稳健的政策进行可靠地进行精确的调整,例如安全限制,在设定的不确定性的最坏实例中,我们不断学习MDP的过渡方法的过渡后,我们用一个可靠的实验方法来比较我们学习环境的精确的精确的精度方法。