We consider the problem of designing policies for Markov decision processes (MDPs) with dynamic coherent risk objectives and constraints. We begin by formulating the problem in a Lagrangian framework. Under the assumption that the risk objectives and constraints can be represented by a Markov risk transition mapping, we propose an optimization-based method to synthesize Markovian policies that lower-bound the constrained risk-averse problem. We demonstrate that the formulated optimization problems are in the form of difference convex programs (DCPs) and can be solved by the disciplined convex-concave programming (DCCP) framework. We show that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints. Finally, we illustrate the effectiveness of the proposed method with numerical experiments on a rover navigation problem involving conditional-value-at-risk (CVaR) and entropic-value-at-risk (EVaR) coherent risk measures.
翻译:我们考虑制定具有动态一致风险目标和制约因素的马尔科夫决策程序(MDPs)的政策制定问题。我们首先在拉格朗吉框架(DCCP)中提出这一问题。假设马尔科维风险过渡映射可以代表风险目标和制约因素,我们提出一种基于优化的方法,以综合马尔科维政策,降低风险反问题限制。我们证明,拟定的优化问题以差异共变方案(DCPs)的形式出现,并且可以通过有纪律的共振-凝聚方案(DCCP)框架(DCCP)解决。我们表明,这些结果将限制的MDPs的线性方案与完全折扣的预期成本和制约因素普遍化。最后,我们展示了拟议方法的有效性,对涉及有条件价值风险(CVaR)和风险(EVaR)一致风险措施的超载导航问题进行了数字实验。