When analyzing large datasets, analysts are often interested in the explanations for surprising or unexpected results produced by their queries. In this work, we focus on aggregate SQL queries that expose correlations in the data. A major challenge that hinders the interpretation of such queries is confounding bias, which can lead to an unexpected correlation. We generate explanations in terms of a set of confounding variables that explain the unexpected correlation observed in a query. We propose to mine candidate confounding variables from external sources since, in many real-life scenarios, the explanations are not solely contained in the input data. We present an efficient algorithm that finds the optimal subset of attributes (mined from external sources and the input dataset) that explain the unexpected correlation. This algorithm is embodied in a system called MESA. We demonstrate experimentally over multiple real-life datasets and through a user study that our approach generates insightful explanations, outperforming existing methods that search for explanations only in the input data. We further demonstrate the robustness of our system to missing data and the ability of MESA to handle input datasets containing millions of tuples and an extensive search space of candidate confounding attributes.
翻译:在分析大型数据集时,分析员往往对解释其查询产生的意外或意外结果感兴趣。在这项工作中,我们侧重于揭示数据相关性的SQL汇总查询。妨碍解释这类查询的一个主要挑战是混淆的偏差,这可能导致出乎意料的关联。我们用一系列混淆的变量来解释在查询中观察到的意外关联。我们建议从外部来源中解析变量,因为在许多现实生活中,解释并不完全包含在输入数据中。我们提出了一个有效的算法,找到最理想的属性(来自外部来源和输入数据集),解释出乎意料的关联。这个算法体现在一个名为 MESA 的系统中。我们通过实验性地展示了多个真实数据集,并通过用户研究,我们的方法产生了深刻的解释,超过了仅仅在输入数据中寻找解释的现有方法。我们进一步展示了我们系统对丢失数据的坚固性,以及MSA处理包含数百万个图案和广泛搜索候选人混结属性的输入数据集的能力。