A flexible conformal inference method is developed to construct confidence intervals for the frequencies of queried objects in very large data sets, based on a much smaller sketch of those data. The approach is data-adaptive and requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals under the sole assumption of data exchangeability. Although our solution is broadly applicable, this paper focuses on applications involving the count-min sketch algorithm and a non-linear variation thereof. The performance is compared to that of frequentist and Bayesian alternatives through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.
翻译:开发了一种灵活的一致推断方法,以根据这些数据的更小得多的草图,为在非常大数据集中查询物体频率建立信任间隔,这种方法是数据适应性的,不要求知道数据分布情况或草图算法的细节;相反,它根据数据可交换性的唯一假设,构建了可证实有效的常住者信任间隔;虽然我们的解决办法广泛适用,但本文件侧重于涉及计数-分钟草图算法和非线性变异的应用程序;通过SARS-COV-2DNA序列和经典英国文学数据集的模拟和实验,将性能与常客和巴耶斯替代方法的性能进行比较。