Preferences play a key role in determining what goals/constraints to satisfy when not all constraints can be satisfied simultaneously. In this paper, we study how to synthesize preference satisfying plans in stochastic systems, modeled as an MDP, given a (possibly incomplete) combinative preference model over temporally extended goals. We start by introducing new semantics to interpret preferences over infinite plays of the stochastic system. Then, we introduce a new notion of improvement to enable comparison between two prefixes of an infinite play. Based on this, we define two solution concepts called safe and positively improving (SPI) and safe and almost-surely improving (SASI) that enforce improvements with a positive probability and with probability one, respectively. We construct a model called an improvement MDP, in which the synthesis of SPI and SASI strategies that guarantee at least one improvement reduces to computing positive and almost-sure winning strategies in an MDP. We present an algorithm to synthesize the SPI and SASI strategies that induce multiple sequential improvements. We demonstrate the proposed approach using a robot motion planning problem.
翻译:普惠制在确定哪些目标/制约可以同时满足并非所有制约因素时满足哪些目标/制约方面发挥着关键作用。在本文件中,我们研究如何将满足偏好满足的随机系统计划综合起来,这种系统建模为MDP,具有(可能不完全的)组合式优待模式,而不是暂时延长的目标。我们首先采用新的语义来解释偏向于随机系统无限功能的偏好。然后,我们引入一个新的改进概念,以便能够比较无限游戏的两个前缀。在此基础上,我们界定了两个解决方案概念,即安全和积极改进(SPI)以及安全和几乎有把握地改进(SASI),分别以积极的可能性和概率之一执行改进。我们构建了一个模型,称为改进MDP,其中保证至少一次改进的SPI和SASI战略的合成将降低到计算多边发展方案中的积极和几乎肯定的得失战略。我们提出了一种算法,以综合SPI和SASI战略,从而产生多重连续改进。我们用机器人动作规划问题展示了拟议的方法。