We present a framework for a controlled Markov chain where the state of the chain is only given at chosen observation times and of a cost. Optimal strategies therefore involve the choice of observation times as well as the subsequent control values. We show that the corresponding value function satisfies a dynamic programming principle, which leads to a system of quasi-variational inequalities (QVIs). Next, we give an extension where the model parameters are not known a priori but are inferred from the costly observations by Bayesian updates. We then prove a comparison principle for a larger class of QVIs, which implies uniqueness of solutions to our proposed problem. We utilise penalty methods to obtain arbitrarily accurate solutions. Finally, we perform numerical experiments on three applications which illustrate our framework.
翻译:我们为一个受控的Markov链提供了一个框架,在这个链条中,链条状态仅在选定的观察时间和成本情况下给出。因此,最佳战略涉及观察时间的选择以及随后的控制值。我们表明,相应的价值函数符合动态的编程原则,从而导致一个准变式不平等(QVIs)系统。接下来,我们提供了一个延伸,即模型参数并不先入为主,而是从Bayesian最新消息的昂贵观察中推断出来。然后,我们证明对更大种类的QVI的比较原则,它意味着解决我们拟议问题的独特性。我们使用惩罚方法获得任意准确的解决办法。最后,我们用三个应用进行数字实验,以说明我们的框架。