Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect long videos to identify relevant objects and events. To reduce this cost, recent work proposes to use cheap proxy models, such as image classifiers, to identify an approximate set of data points satisfying a data selection filter. Unfortunately, this recent work does not provide the statistical accuracy guarantees necessary in scientific and production settings. In this work, we introduce novel algorithms for approximate selection queries with statistical accuracy guarantees. Namely, given a limited number of exact identifications from an oracle, often a human or an expensive machine learning model, our algorithms meet a minimum precision or recall target with high probability. In contrast, existing approaches can catastrophically fail in satisfying these recall and precision targets. We show that our algorithms can improve query result quality by up to 30x for both the precision and recall targets in both real and synthetic datasets.
翻译:由于数据获取和存储成本的下降,研究人员和产业分析员往往希望在大型数据集中找到所有罕见事件的事例。例如,科学家可以廉价地捕捉数千小时的视频,但由于需要人工检查长视频以识别相关对象和事件而受到限制。为了降低这一成本,最近的工作提议使用廉价的代用模型,如图像分类师等,以确定符合数据筛选过滤器的一套近似数据点。不幸的是,最近的工作没有为科学和生产环境提供必要的统计准确性保证。在这项工作中,我们引入了具有统计准确性保证的近似选择查询的新算法。也就是说,如果从一个神器(通常是一个人类或一个昂贵的机器学习模型)得到数量有限的精确识别,那么我们的算法就达到一个最起码的精确度或极有可能的召回目标。相比之下,现有的方法可能灾难性地无法满足这些召回和精确的目标。我们表明,我们的算法可以提高查询质量,在实际和合成数据集的精确度和召回目标上达到30x。