A data sketch algorithm scans a big data set, collecting a small amount of data -- the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.
翻译:数据草图算法扫描了大数据集,收集了少量数据 -- -- 即草图,可用于从统计角度推断大数据集的属性。一些数据草图算法对大数据集进行固定规模随机抽样,并用该样本推断出符合大数据集中各种标准的项目的频率。本文显示如何从统计角度以高效和精确的方式推断出这些频率的近似正确(PAC)界限,从而精确地表明频率界限要么尖锐要么只有一个,这是不精确计算的最佳结果。