A large class of decision making under uncertainty problems can be described via Markov decision processes (MDPs) or partially observable MDPs (POMDPs), with application to artificial intelligence and operations research, among others. Traditionally, policy synthesis techniques are proposed such that a total expected cost or reward is minimized or maximized. However, optimality in the total expected cost sense is only reasonable if system behavior in the large number of runs is of interest, which has limited the use of such policies in practical mission-critical scenarios, wherein large deviations from the expected behavior may lead to mission failure. In this paper, we consider the problem of designing policies for MDPs and POMDPs with objectives and constraints in terms of dynamic coherent risk measures, which we refer to as the constrained risk-averse problem. For MDPs, we reformulate the problem into a infsup problem via the Lagrangian framework and propose an optimization-based method to synthesize Markovian policies. For MDPs, we demonstrate that the formulated optimization problems are in the form of difference convex programs (DCPs) and can be solved by the disciplined convex-concave programming (DCCP) framework. We show that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints. For POMDPs, we show that, if the coherent risk measures can be defined as a Markov risk transition mapping, an infinite-dimensional optimization can be used to design Markovian belief-based policies. For stochastic finite-state controllers (FSCs), we show that the latter optimization simplifies to a (finite-dimensional) DCP and can be solved by the DCCP framework. We incorporate these DCPs in a policy iteration algorithm to design risk-averse FSCs for POMDPs.
翻译:在不确定性问题下,可以通过Markov决策程序(MDPs)或部分可见的MDPs(POMDPs)来描述在不确定性问题下进行大量决策的情况,并应用于人工智能和业务研究等。传统上,提出了政策综合技术,这样可以最大限度地降低预期总成本或奖励,但是,如果大量运行的系统行为引起兴趣,从而限制了这种政策在实际任务关键情景中的使用,从而可能大大偏离预期行为,从而导致任务失败。在本文件中,我们考虑了设计MDPs和POMDPs政策的问题,其目标和制约因素是动态一致的风险评估措施,我们称之为限制风险风险风险度的问题。然而,对于MDPs,我们通过Lagrangian框架重新将问题变成一个难以解决的问题,并提出一种以优化为基础的方法来综合Markovian政策。对于MDPs来说,我们提出的优化问题可以表现为差异组合程序(DCPs)的形式,并且可以通过有节制的组合-CPsal-CPs 来解决目标和制约因素,我们称之为限制度设计成本值的模型,我们用Sloy-DPsal-decial-deal DPs a laction laction as laction maxal laction lactional maps as as as lax lax lax laction as as as as s