Estimation of missing mass with the popular Good-Turing (GT) estimator is well-understood in the case where samples are independent and identically distributed (iid). In this article, we consider the same problem when the samples come from a stationary Markov chain with a rank-2 transition matrix, which is one of the simplest extensions of the iid case. We develop an upper bound on the absolute bias of the GT estimator in terms of the spectral gap of the chain and a tail bound on the occupancy of states. Borrowing tail bounds from known concentration results for Markov chains, we evaluate the bound using other parameters of the chain. The analysis, supported by simulations, suggests that, for rank-2 irreducible chains, the GT estimator has bias and mean-squared error falling with number of samples at a rate that depends loosely on the connectivity of the states in the chain.
翻译:与流行的Good-Turing(GT)估计仪估算缺失质量在样本独立且分布相同的情况下是十分清楚的。 在本条中,当样本来自一个固定的Markov链条,具有一级至二级过渡矩阵,这是iid案件最简单的延伸之一。我们开发了一个上限,以GT测算器在链条的光谱差距和州占有权的尾巴方面的绝对偏差为标准。从已知的Markov链条浓度结果中借取尾线,我们使用该链条的其他参数评估其约束。在模拟的支持下的分析表明,对于二级不可移动链条,GT测算器带有偏差和中度误差,与样本数量相依,其速率在很大程度上取决于各州在链条上的连通性。