Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard approximate query processing techniques to accelerate such queries are not applicable because they assume the result of the predicates are available ahead of time. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates. To accelerate aggregation queries with expensive predicates, we develop and analyze a query processing algorithm that leverages proxies (ABae). ABae must account for the key challenge that it may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group records into strata so that records satisfying the predicate are ideally grouped into few strata. Given these strata, ABae uses pilot sampling and plugin estimates to sample according to the optimal allocation. We show that ABae converges at an optimal rate in a novel analysis of stratified sampling with draws that may not satisfy the predicate. We further show that ABae outperforms on baselines on six real-world datasets, reducing labeling costs by up to 2.3x.
翻译:研究人员和产业分析员越来越有兴趣计算大型、非结构化的数据集的汇总查询,这些数据集使用昂贵的深神经网络(DNNS)计算出有选择的上游数据。由于这些DNNS费用昂贵,而且由于许多应用都能够容忍近似答案,因此分析员有兴趣通过近似值加快这些查询。不幸的是,标准的近似查询处理技术并不适用,因为他们假定上游数据的结果可以提前获得,因此加速这类查询的标准近似处理技术是不适用的。此外,最近使用廉价近近似(即代理人)进行的工作不支持以上游为样本的汇总查询。为了以昂贵的上游数据加速汇总查询,我们开发并分析一种利用代理数据(ABae)的查询处理算法。ABae必须说明它可能抽样记录不能满足上游数据的关键挑战。为了应对这一挑战,我们首先使用分组记录,以便满足上游数据的记录最好分为几层。鉴于这些层,ABae利用试点取样和插件估计数来进行抽样。我们表明,Abae在对Stracurizering抽样进行最新分析时采用最佳比率,我们无法通过上游标定的标定的六级标定的标定的标定标定数字,从而可以满足实际标定标定价格。