For exploratory data analysis, it is often desirable to know what answers you are likely to get before actually obtaining those answers. This can potentially be achieved by designing systems to offer the estimates of a data operation result -- say op(data) -- earlier in the process based on partial data processing. Those estimates continuously refine as more data is processed and finally converge to the exact answer. Unfortunately, the existing techniques -- called Online Aggregation (OLA) -- are limited to a single operation; that is, we cannot obtain the estimates for op(op(data)) or op(...(op(data))). If this Deep OLA becomes possible, data analysts will be able to explore data more interactively using complex cascade operations. In this work, we take a step toward Deep OLA with evolving data frames (edf), a novel data model to offer OLA for nested ops -- op(...(op(data))) -- by representing an evolving structured data (with converging estimates) that is closed under set operations. That is, op(edf) produces yet another edf; thus, we can freely apply successive operations to edf and obtain an OLA output for each op. We evaluate its viability with Wake, an edf-based OLA system, by examining against state-of-the-art OLA and non-OLA systems. In our experiments on TPC-H dataset, Wake produces its first estimates 4.93x faster (median) -- with 1.3x median slowdown for exact answers -- compared to conventional systems. Besides its generality, Wake is also 1.92x faster (median) than existing OLA systems in producing estimates of under 1% relative errors.
翻译:对于探索性数据分析,通常最好知道在实际获得这些答案之前你可能会得到什么答案。如果能够实现这一深度OLA,数据分析师将能够通过设计系统来提供数据操作结果的估计数 -- -- 例如 o(data) -- -- 早期在部分数据处理的基础上进行。随着更多的数据被处理,这些估计数在不断完善,最终会与准确的答案汇合。不幸的是,现有的技术 -- -- 称为在线聚合(OLA) -- -- 仅限于一个单一操作;也就是说,我们无法获得对op(op(data))或op(...(op(data)))或op(...(...(dop(data))))的估计数。如果这个深度OLA成为可能,数据分析员将能够用复杂的级联运作业来更交互式地探索数据操作结果。在这项工作中,我们迈出了向深层OLA的一步,数据框架(edf),一个新的数据模型为嵌套(op(d) -- -- (op(dverging 估计数) -- -- 在设定操作的系统中,O-deal-deal A(deal-A) a real-deal A) a deviewdal A (Wef) ex) a ex a ex) a ex a ex a ex (我们可以自由对当前系统进行连续的自动检查。</s>