Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such translations manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objectives (expressed in logic or automata formalism) to scalar rewards for discrete-time Markov decision processes (MDPs). Unfortunately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega-regular languages generalize regular languages to infinite-horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense-time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spending positive time in the good states, and 2) expectation semantics where the goal of the learner is to optimize the long-run expected average time spent in the ``good states" of the automaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the-shelf RL algorithms for CTMDPs. We demonstrate the effectiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.
翻译:连续时间马尔可夫决策过程(CTMDP)是在密集时间和随机环境下进行时序决策的基本模型。当环境的随机演进仅通过抽样可用时,无模型的强化学习(RL)是计算最优决策序列的算法选择。然而,RL需要将学习目标编码为标量奖励信号。由于手动执行这样的翻译既繁琐又容易出错,因此已经提出了一些技术来将高级目标(在逻辑或自动机形式中表达)转换为标量奖励以用于离散时间马尔可夫决策过程(MDP)中。不幸的是,在CTMDP上不存在自动翻译,因此本文考虑将CTMDP环境与以omega正则语言表示的学习目标相结合。omega正则语言将无线视野规则化,可以表达在流行的线性时态逻辑LTL中给出的属性。为了适应CTMDP稠密时间的性质,我们考虑了两种不同的omega正则目标语义:1)满足语义,即学习器的目标是最大化在好状态中花费的正时间的概率;2)期望语义,即学习器的目标是优化自动机的“好状态”中花费的长时间期望平均值。我们提出了一种方法,可以对标量奖励信号进行正确转换,这些信号可以由现成的RL算法用于CTMDP中。我们通过在一些流行的具有omega正则目标的CTMDP基准上进行评估,证明了所提出算法的有效性。