The parameters for a Markov Decision Process (MDP) often cannot be specified exactly. Uncertain MDPs (UMDPs) capture this model ambiguity by defining sets which the parameters belong to. Minimax regret has been proposed as an objective for planning in UMDPs to find robust policies which are not overly conservative. In this work, we focus on planning for Stochastic Shortest Path (SSP) UMDPs with uncertain cost and transition functions. We introduce a Bellman equation to compute the regret for a policy. We propose a dynamic programming algorithm that utilises the regret Bellman equation, and show that it optimises minimax regret exactly for UMDPs with independent uncertainties. For coupled uncertainties, we extend our approach to use options to enable a trade off between computation and solution quality. We evaluate our approach on both synthetic and real-world domains, showing that it significantly outperforms existing baselines.
翻译:马尔科夫 决策进程( MDP) 的参数往往无法精确指定 。 不确定的 MDP( UDPs) 通过界定参数所属的数组来捕捉模型的模糊性。 Minimax 遗憾已被提议为UDPs规划的目标, 以寻找稳健的、 不过分保守的政策。 在这项工作中, 我们侧重于规划成本和过渡功能不确定的Stochatic Fortest Path( SSP) UDPs 。 我们引入了贝尔曼方程式来计算政策的遗憾。 我们提出一个动态的编程算法, 使用后悔的 Bellman 方程式, 并显示它选择微麦斯对具有独立不确定性的 UMDPs 感到遗憾 。 相加的不确定性, 我们扩展了我们的方法, 使计算和解决方案质量之间的交易能够实现平衡。 我们评估了我们在合成领域和现实世界领域的做法, 显示它大大超过现有基线 。