In many practical settings control decisions must be made under partial/imperfect information about the evolution of a relevant state variable. Partially Observable Markov Decision Processes (POMDPs) is a relatively well-developed framework for modeling and analyzing such problems. In this paper we consider the structural estimation of the primitives of a POMDP model based upon the observable history of the process. We analyze the structural properties of POMDP model with random rewards and specify conditions under which the model is identifiable without knowledge of the state dynamics. We consider a soft policy gradient algorithm to compute a maximum likelihood estimator and provide a finite-time characterization of convergence to a stationary point. We illustrate the estimation methodology with an application to optimal equipment replacement. In this context, replacement decisions must be made under partial/imperfect information on the true state (i.e. condition of the equipment). We use synthetic and real data to highlight the robustness of the proposed methodology and characterize the potential for misspecification when partial state observability is ignored.
翻译:部分可观察的 Markov 决策程序(POMDPs)是一个比较完善的模型和分析框架。在本文中,我们考虑根据可观察的过程历史对POMDP模型原始部分进行结构估计。我们用随机奖赏分析POMDP模型的结构特性,并具体说明模型在不了解国家动态的情况下可以识别的条件。我们认为软政策梯度算法可以计算出最大可能性的估测器,并提供与固定点相融合的限定时间特征。我们用最佳设备替换的应用来说明估算方法。在这方面,替换决定必须在关于真实状态(即设备状况)的部分/不完善信息下作出。我们使用合成和真实数据来突出拟议方法的稳健性,并在忽略部分国家可观察性时说明误差的可能性。