We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted average of mean and percentile performances, and it covers the distributionally robust MDPs and the distributionally robust chance-constrained MDPs (both under reward ambiguity) as special cases. By considering that the unknown reward distribution lies in a Wasserstein ambiguity set, we derive the tractable reformulation for our model. In particular, we show that that the return-risk model can also account for risk from uncertain transition kernel when one only seeks deterministic policies, and that a distributionally robust MDP under the percentile criterion can be reformulated as its nominal counterpart at an adjusted risk level. A scalable first-order algorithm is designed to solve large-scale problems, and we demonstrate the advantages of our proposed model and algorithm through numerical experiments.
翻译:我们为Markov决策程序提出了一种分布稳健的回报风险模型(MDPs ), 风险和奖励含混。 拟议的模型优化了平均和百分位表现的加权平均值,并将分布稳健的 MDPs和分配稳健、机会受限的 MDPs(两者均在奖励含混中)作为特例。 考虑到未知的奖赏分配是在瓦塞斯坦语的一套模棱两可中,我们得出了我们模型的可移植重拟。 特别是,我们表明,当人们只寻求确定性政策时,返回风险模型也可以考虑到不确定性的过渡内核的风险,而且根据百分位标准,一个分布稳健的MDP可以重塑为在调整风险水平上的名义对应方。 一个可缩放的第一阶算法旨在解决大规模问题,我们通过数字实验展示了我们提议的模型和算法的优点。