《从数据中独立发现》的算法 (An Algorithm for the Discovery of Independence from Data)

For years, independence has been considered as an important concept in many disciplines. Nevertheless, we present the first research that investigates the discovery problem of independence in data. In its arguably simplest form, independence is a statement between two sets of columns expressing that for every two rows in a table there is also a row in the table that coincides with the first row on the first set of columns and with the second row on the second set of columns. We show that the problem of deciding whether there is an independence statement that holds on a given table is not only NP-complete but $W[3]$-complete in its arguably most natural parameter, namely its arity. We establish the first algorithm to discover all independence statement that hold on a given table. We illustrate in experiments with benchmark data that our algorithm performs well within the limits established by our hardness results. In practice, it is often useful to determine the ratio with which independence statements hold on a given table. For that purpose, we show that our treatment of independence and the design of our algorithm enables us to extend our findings to approximate independence. In our final experiments, we provide some insight into the trade-off between run time and the approximation ratio. Naturally, the smaller the ratio, the more approximate independence statements hold, and the more time it takes to discover all of them. While this research establishes first insight into the computational properties of discovering independence from data, we hope to initiate research into more sophisticated notions of independence, including embedded multivalued dependencies, as well as their context-specific and probabilistic variants.

翻译：多年来,独立一直被认为是许多学科中的一个重要概念。然而,我们提出了调查数据独立性发现问题的第一份研究。以其最简单的形式,独立是两组列间的一项说明,表明表内每两行中还有一行与第一列第一行相吻合,第二行与第二组列第二行相吻合。我们表明,在确定某一表内是否保留独立声明的问题上存在问题,不仅是NP的完整,而且在其可能最自然的参数,即其公正性方面,我们提出了[3]美元。我们建立了第一个算法,以发现所有保留在某一表上的独立声明。我们用基准数据实验来说明,我们的算法在我们的硬性结果所确定的限度内,也有一个行。在实践中,确定独立声明在某一表内保持何种比例是有用的。为此,我们从对独立性的处理和算法的设计,使我们得以将我们的调查结果扩展到大概的独立性。在我们最后的实验中,我们提供了一些关于贸易-直径的准确性说明,包括时间和直径的精确性比率,当然地说,我们掌握着这种直径和直径的准确性。