A flexible method is developed to construct a confidence interval for the frequency of a queried object in a very large data set, based on a much smaller sketch of the data. The approach requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals for random queries using a conformal inference approach. After achieving marginal coverage for random queries under the assumption of data exchangeability, the proposed method is extended to provide stronger inferences accounting for possibly heterogeneous frequencies of different random queries, redundant queries, and distribution shifts. While the presented methods are broadly applicable, this paper focuses on use cases involving the count-min sketch algorithm and a non-linear variation thereof, to facilitate comparison to prior work. In particular, the developed methods are compared empirically to frequentist and Bayesian alternatives, through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.
翻译:开发了一种灵活的方法,以便在一个非常庞大的数据集中,根据数据小得多的草图,为被查询对象的频率构建一个信任间隔。这种方法不需要对数据分布或素描算算法细节的任何了解;相反,它用一种一致的推理法,为随机查询构建了可证实有效的常时信任间隔。在根据数据互换性假设对随机查询进行边际覆盖后,建议的方法扩大,为不同随机查询、冗余查询和分布变换的可能不同频率提供更有力的推论。虽然所提出的方法广泛适用,但本文件侧重于使用涉及计数-分钟素描算法和非线性变异的案例,以便利与先前工作的比较。特别是,通过对SARS-COV-2DNA序列和经典英国文学的数据集进行模拟和实验,将所开发的方法与经常和巴耶斯替代方法进行经验上的比较。