Much real-world data come with explicitly defined domain orders; e.g., lexicographic order for strings, numeric for integers, and chronological for time. Our goal is to discover implicit domain orders that we do not already know; for instance, that the order of months in the Chinese Lunar calendar is Corner < Apricot < Peach. To do so, we enhance data profiling methods by discovering implicit domain orders in data through order dependencies. We enumerate tractable special cases and proceed towards the most general case, which we prove is NP-complete. We show that the general case nevertheless can be effectively handled by a SAT solver. We also devise an interestingness measure to rank the discovered implicit domain orders, which we validate with a user study. Based on an extensive suite of experiments with real-world data, we establish the efficacy of our algorithms, and the utility of the domain orders discovered by demonstrating significant added value in three applications (data profiling, query optimization, and data mining).
翻译:许多真实世界数据都包含明确界定的域令,例如字符串的词汇顺序、整数的数值和时间的时序。我们的目标是发现我们尚未知道的隐性域令;例如,中国月球日历中的月份顺序是Corner < Apricot < Peach。为了做到这一点,我们通过顺序依赖在数据中发现隐性域令来增强数据特征分析方法。我们列举了可移植的特殊案例,并着手处理最普通的案例,我们证明这些案例是NP-完整的。我们证明一般案例仍然可以由SAT解答者有效处理。我们还设计了一个有趣的措施,对发现的隐性域令进行排序,我们通过用户研究加以验证。我们以一系列与现实世界数据有关的广泛实验为基础,建立了我们算法的功效,并通过在三种应用(数据剖析、查询优化和数据挖掘)中展示重大增值而发现的域令的效用。