We consider Markov decision processes where the state of the chain is only given at chosen observation times and of a cost. Optimal strategies involve the optimisation of observation times as well as the subsequent action values. We consider the finite horizon and discounted infinite horizon problems, as well as an extension with parameter uncertainty. By including the time elapsed from observations as part of the augmented Markov system, the value function satisfies a system of quasi-variational inequalities (QVIs). Such a class of QVIs can be seen as an extension to the interconnected obstacle problem. We prove a comparison principle for this class of QVIs, which implies uniqueness of solutions to our proposed problem. Penalty methods are then utilised to obtain arbitrarily accurate solutions. Finally, we perform numerical experiments on three applications which illustrate our framework.
翻译:我们考虑的是马可夫决策程序,其中链条状态只能在选定的观察时间和成本情况下给出。最佳战略包括优化观察时间和随后的行动值。我们考虑有限地平线和折扣的无限地平线问题,以及参数不确定性的延伸。通过将观测所花的时间作为扩大的马可夫系统的一部分,价值功能满足了一种准变量不平等(QVIs)制度。这种类别QVIs可被视为向相互关联的障碍问题延伸的延伸。我们证明了这一类QVIs的比较原则,它意味着解决我们拟议问题的独特性。然后利用惩罚方法获得任意准确的解决办法。最后,我们用数字实验来说明我们框架的三个应用。</s>