We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.
翻译:我们考虑从随机抽样的数据集中估算一个大数据集(或等量的数据集所引发的分布支持规模)中不同元素数量的问题。 问题出现在许多应用中, 包括生物学、 基因组学、 计算机系统和语言学。 过去十年的一行研究产生了一种算法, 从一个大小的样本O( log2/2 1/\ varepsilon)\ cdot survial n/\log n), 其中美元是数据集的精确度。 不幸的是, 这个界限很紧, 限制了对问题复杂性的进一步改进。 在本文中, 我们考虑估算算法会增加一个基于机器学习的预测, 根据任何要素, 返回其频率的估计值。 我们显示, 如果预测值正确到一个恒定的近度系数, 那么样本的复杂性可以大大降低到 [\log (1/ varepsilon) n/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ 类的类等级的运中, 用于我们数据库中的数据序列中的数据序列中, 数据库中的数据序列中, 的对比分析器分析器的系统。 我们的序列的序列中, 。 我们的序列的序列的序列的序列的序列的序列的序列的序列的序列的计算。 我们/\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\