The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMS-DP, has been proposed by Cai, Mitzenmacher and Adams (\textit{NeurIPS} 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being obtained as suitable mean functionals of the posterior distribution of the point query, given the hashed data. While the CMS-DP has proved to improve on some aspects of CMS, it has the major drawback of arising from a ``constructive" proof that builds upon arguments tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we present a ``Bayesian" proof of the CMS-DP that has the main advantage of building upon arguments that are usable, in principle, within a broad class of nonparametric priors arising from normalized completely random measures. This result leads to develop a novel learning-augmented CMS under power-law data streams, referred to as CMS-PYP, which relies on BNP modeling of the data stream of tokens via a Pitman-Yor process (PYP) prior. Under this more general framework, we apply the arguments of the ``Bayesian" proof of the CMS-DP, suitably adapted to the PYP prior, in order to compute the posterior distribution of a point query, given the hashed data. Applications to synthetic data and real textual data show that the CMS-PYP outperforms the CMS and the CMS-DP in estimating low-frequency tokens, which are known to be of critical interest in textual data, and it is competitive with respect to a variation of the CMS designed for low-frequency tokens. An extension of our BNP approach to more general queries is also discussed.
翻译:计数进程草图( CMS) 是一个时间和记忆高效的随机数据结构, 它通过一个 Drichlet 进程( DP), 以随机散列数据为基础, 即点查询, 提供对象征的象征频率的估计。 Cai, Mitzenmacher 和 Adams (\ textit{NeurIPS} 2018) 提出了一个学习强化版的 CMS 。 它依赖于 Bayesian 的非参数( BPNP) 模式, 通过一个 Drichlet 进程( DP), 提供对标志流的数据流的估算, 即基于随机流的 CMS- PMS, 以鼠标的后端点分配为合适的平均值。 CMS- PRODM 的预估测点参数, 以预变现的预变现程序, 以预变现的C- PMS 数据流为原始数据流, 以预变现的预变现程序为原始数据。