不错过树木的森林 -- -- 解释嵌套数据缺失的答案的综合办法(扩展版) (To not miss the forest for the trees -- a holistic approach for explaining missing answers over nested data (extended version))

Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven to be useful in a variety of contexts including debugging of complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach for producing query-based explanations. Our approach is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting and projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an implementation of our approach on Spark, we demonstrate that it is the first to scale to large datasets and that it often finds explanations that existing techniques fail to identify.

翻译：对于缺失的答案,基于查询的解释可以确定哪个查询的操作员对未能返回缺失的利息答案负有责任。这种解释在各种情况下都证明是有用的,包括调试复杂的分析查询。这类查询在大数据系统(如Apache Spark)中很常见。我们提出了一个基于查询的解释的新办法。我们的方法是首先支持嵌套数据,并将修改数据结构和结构的操作员(如嵌套和预测)视为缺失答案的潜在原因。为了有效地计算解释,我们建议一种超自然算法,应用两种新颖技术:(一) 查询的多种系统替代方法的推理,以及(二) 在每个步骤中重新验证中间结果是否有助于缺失的答案。我们利用在Spark上采用的方法,我们证明首先缩小大数据集的规模,而且常常发现现有技术无法识别的解释。

相关内容

Spark

关注 51

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

应用机器学习书稿，361页pdf

专知会员服务

59+阅读 · 2020年11月24日

复杂的序列数据分析：现有算法的系统文献综述，Complex Sequential Data Analysis: A Systematic Literature Review of Existing Algorithms

专知会员服务

27+阅读 · 2020年7月24日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

【经典书】贝叶斯编程，378页pdf，Bayesian Programming

专知会员服务

250+阅读 · 2020年5月18日