Interval Markov Decision Processes (IMDPs) are finite-state uncertain Markov models, where the transition probabilities belong to intervals. Recently, there has been a surge of research on employing IMDPs as abstractions of stochastic systems for control synthesis. However, due to the absence of algorithms for synthesis over IMDPs with continuous action-spaces, the action-space is assumed discrete a-priori, which is a restrictive assumption for many applications. Motivated by this, we introduce continuous-action IMDPs (caIMDPs), where the bounds on transition probabilities are functions of the action variables, and study value iteration for maximizing expected cumulative rewards. Specifically, we decompose the max-min problem associated to value iteration to $|\mathcal{Q}|$ max problems, where $|\mathcal{Q}|$ is the number of states of the caIMDP. Then, exploiting the simple form of these max problems, we identify cases where value iteration over caIMDPs can be solved efficiently (e.g., with linear or convex programming). We also gain other interesting insights: e.g., in certain cases where the action set $\mathcal{A}$ is a polytope, synthesis over a discrete-action IMDP, where the actions are the vertices of $\mathcal{A}$, is sufficient for optimality. We demonstrate our results on a numerical example. Finally, we include a short discussion on employing caIMDPs as abstractions for control synthesis.
翻译:区间马尔科夫决策过程(IMDPs)是有限状态的不确定马尔科夫模型,其中转移概率属于区间。最近,有越来越多的研究将IMDP作为随机系统的抽象来进行控制合成。然而,由于缺乏针对具有连续行动空间的IMDP的综合算法,因此先验地假设行动空间是离散的,这对许多应用来说是一种限制性假设。出于这个原因,我们引入了连续行动IMDP(caIMDPs),其中转移概率的界限是行动变量的函数,并研究最大化预期累积奖励的价值迭代。具体而言,我们将与价值迭代相关的max-min问题分解为$|\mathcal{Q}|$个max问题,其中$|\mathcal{Q}|$是caIMDP的状态数量。然后,利用这些max问题的简单形式,我们确定了在哪些情况下可以高效地解决caIMDP上的价值迭代(例如,使用线性或凸规划)。我们还获得了其他有趣的见解:例如,在行动集$\mathcal{A}$是多面体的某些情况下,综合离散行动IMDP,在其中行动是$\mathcal{A}$ 的顶点,足以实现最优性。我们通过数值实例展示了我们的结果。最后,我们讨论了将caIMDP用作控制综合的抽象的缩短。