Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven to be useful in a variety of contexts including debugging of complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach for producing query-based explanations. Our approach is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting and projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an implementation of our approach on Spark, we demonstrate that it is the first to scale to large datasets and that it often finds explanations that existing techniques fail to identify.
翻译:对于缺失的答案,基于查询的解释可以确定哪个查询的操作员对未能返回缺失的利息答案负有责任。这种解释在各种情况下都证明是有用的,包括调试复杂的分析查询。这类查询在大数据系统(如Apache Spark)中很常见。我们提出了一个基于查询的解释的新办法。我们的方法是首先支持嵌套数据,并将修改数据结构和结构的操作员(如嵌套和预测)视为缺失答案的潜在原因。为了有效地计算解释,我们建议一种超自然算法,应用两种新颖技术:(一) 查询的多种系统替代方法的推理,以及(二) 在每个步骤中重新验证中间结果是否有助于缺失的答案。我们利用在Spark上采用的方法,我们证明首先缩小大数据集的规模,而且常常发现现有技术无法识别的解释。