The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.
翻译:估计覆盖面概率,特别是缺失质量的概率,是许多科学领域应用的典型统计问题。在本文中,我们研究这个问题,研究的是随机数据压缩或草图。这是一个新颖但实际相关的视角,它指的是必须依据真实数据的压缩和不完善摘要或草图来估计覆盖面概率的情况,因为无法直接观察到全部数据或不同符号的经验频率。我们的贡献是用一种巴伊西亚非参数方法来估计通过随机散射绘制的数据的覆盖面概率,这也解决了在真实数据中恢复不同计数的数量和以特定的经验性频率进行不同计数的难题。拟议的巴伊斯估计值与先前的迪里赫莱特进程一起,很容易适用于大规模分析,尽管在更普遍的Pitman-Yor进程下,它们涉及一些公开的计算挑战。我们方法的经验有效性通过数字实验和对CovidDNA序列、经典英国文学和IP地址的实际数据集的应用得到证明。