In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
翻译:在这项工作中,我们研究一个自然的非参数性估计器,用于测量受控的有限马尔科夫链链的过渡概率矩阵。我们考虑使用所谓的伐木政策收集一个固定数据集的离线设置。我们为估算器开发了样本复杂性界限,并为微缩度创造了条件。我们的统计界限通过其混合特性取决于伐木政策。我们表明,实现特定的统计风险界限,涉及混合特性的强度与样本数量之间的微妙和有趣的权衡。我们在各种例子中展示了我们的结果的有效性,例如ERgodic Markov链、弱的不相容马尔科夫链,以及用非静止的Markov、附带和贪婪的控制措施控制了Markov链。最后,我们利用这些样本复杂性界限来为固定的马尔科夫控制政策的离线评价设定配套的参数。