We develop a central limit theorem (CLT) for the non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build upon it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable us to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.
翻译:本文针对具有有限状态-动作空间的受控马尔可夫链,建立了其转移矩阵非参数估计量的中心极限定理。我们的研究结果明确了记录策略在何种条件下能使估计量具有渐近正态性,并揭示了不存在中心极限定理的场景。在此基础上,我们进一步推导了任意平稳随机策略(包括从估计模型恢复的最优策略)的价值函数、Q函数及优势函数的中心极限定理。作为推论,我们提出了拟合优度检验方法,可用于检验记录数据是否具有随机性。这些结果为离线策略评估与最优策略恢复提供了新的统计工具,并支持对转移概率进行假设检验。