Given a lossy-compressed representation, or sketch, of data with values in a set of symbols, the frequency recovery problem considers the estimation of the empirical frequency of a new data point. Recent studies have applied Bayesian nonparametrics (BNPs) to develop learning-augmented versions of the popular count-min sketch (CMS) recovery algorithm. In this paper, we present a novel BNP approach to frequency recovery, which is not built from the CMS but still relies on a sketch obtained by random hashing. Assuming data to be modeled as random samples from an unknown discrete distribution, which is endowed with a Poisson-Kingman (PK) prior, we provide the posterior distribution of the empirical frequency of a symbol, given the sketch. Estimates are then obtained as mean functionals. An application of our result is presented for the Dirichlet process (DP) and Pitman-Yor process (PYP) priors, and in particular: i) we characterize the DP prior as the sole PK prior featuring a property of sufficiency with respect to the sketch, leading to a simple posterior distribution; ii) we identify a large sample regime under which the PYP prior leads to a simple approximation of the posterior distribution. Then, we develop our BNP approach to a "traits" formulation of the frequency recovery problem, not yet studied in the CMS literature, in which data belong to more than one symbol (trait), and exhibit nonnegative integer levels of associations with each trait. In particular, by modeling data as random samples from a generalized Indian buffet process, we provide the posterior distribution of the empirical frequency level of a trait, given the sketch. This result is then applied under the assumption of a Poisson and Bernoulli distribution for the levels of associations, leading to a simple posterior distribution and a simple approximation of the posterior distribution, respectively.
翻译:给定值域为符号集合的数据的损失压缩表示,或称为草图(sketch),频率恢复问题考虑估计新的数据点的经验频率。最近的研究将贝叶斯非参数(BNPs)应用于开发混合频率计数(min-sketch(CMS))恢复算法的学习增强版本。在本文中,我们提出了一种新颖的BNP方法来进行频率恢复,它不是基于CMS构建的,但仍然依赖于通过随机哈希得到的草图。假设数据被建模为来自未知离散分布的随机样本,该分布带有泊松-金曼(PK)先验,我们给出了在草图给定的情况下符号经验频率的后验分布。然后,估计值作为均值函数得到。我们还应用我们的结果于狄利克雷过程(DP)和皮特曼 - 约尔过程(PYP)先验,并特别提出:i)我们将DP先验表征为仅具有关于草图充分性质的PK先验,导致后验分布简单; ii)我们确定了一个大样本区域,在这个区域内,PYP先验导致了对后验分布的简单逼近。然后,我们发展了我们的BNP方法应用于频率恢复问题的"特性"公式,这在CMS文献中尚未研究,其中数据属于多个符号(特性),并展示了每个特性的非负整数级别的关联。特别地,通过将数据建模为来自广义印第安自助选股过程的随机样本,我们提供了在草图给定的情况下特性的经验频率级别的后验分布。然后,在假设关联级别的泊松分布和伯努利分布的情况下,我们分别得到了简单的后验分布和简单的后验分布逼近。