Decision-making policies for agents are often synthesized with the constraint that a formal specification of behaviour is satisfied. Here we focus on infinite-horizon properties. On the one hand, Linear Temporal Logic (LTL) is a popular example of a formalism for qualitative specifications. On the other hand, Steady-State Policy Synthesis (SSPS) has recently received considerable attention as it provides a more quantitative and more behavioural perspective on specifications, in terms of the frequency with which states are visited. Finally, rewards provide a classic framework for quantitative properties. In this paper, we study Markov decision processes (MDP) with the specification combining all these three types. The derived policy maximizes the reward among all policies ensuring the LTL specification with the given probability and adhering to the steady-state constraints. To this end, we provide a unified solution reducing the multi-type specification to a multi-dimensional long-run average reward. This is enabled by Limit-Deterministic B\"uchi Automata (LDBA), recently studied in the context of LTL model checking on MDP, and allows for an elegant solution through a simple linear programme. The algorithm also extends to the general $\omega$-regular properties and runs in time polynomial in the sizes of the MDP as well as the LDBA.
翻译:代理商的决策政策往往被综合为符合正式行为规范的制约。 在这里,我们侧重于无限偏顺特性。 一方面, 线性时空逻辑(LTL)是典型定性规范的流行范例。 另一方面, 稳态国家政策综合(SSPS)最近受到相当的重视,因为它从访问国家的频率的角度,对规格提供了更加量化和更加行为的观点。 最后, 奖励为量化属性提供了一个典型的框架。 在本文中, 我们用所有这三种类型的规格来研究Markov 决策过程(MDP ) 。 衍生政策是所有政策中确保LTL规格的奖励最大化的,具有给定的概率并坚持稳定状态制约。 为此,我们提供了一种统一的解决办法,将多型规格降低为多层面的长期平均奖励。 这得益于最近在LTL模型中研究的量化属性模型检查,并允许通过简单的线性方案,通过简单的线性方案,获得优雅的解决方案。 作为MAM的常规和常规,MLA。