Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task definition and it can be difficult to leverage them for complex integration pipelines. As a result, evaluating end-to-end pipelines for the entire data integration process is still an elusive goal. In this work, we present Alaska, the first benchmark based on real-world dataset to support seamlessly multiple tasks (and their variants) of the data integration pipeline. The dataset consists of ~70k heterogeneous product specifications from 71 e-commerce websites with thousands of different product attributes. Our benchmark comes with profiling meta-data, a set of pre-defined use cases with diverse characteristics, and an extensive manually curated ground truth. We demonstrate the flexibility of our benchmark by focusing on several variants of two crucial data integration tasks, Schema Matching and Entity Resolution. Our experiments show that our benchmark enables the evaluation of a variety of methods that previously were difficult to compare, and can foster the design of more holistic data integration solutions.
翻译:数据整合是数据管理界的长期利益,它有许多不同的应用,包括商业、科学和政府。我们最近目睹了具体数据整合任务,如实体决议等具体数据整合任务取得令人印象深刻的成果,因为基准的可用性越来越大。这些基准的局限性在于它们通常具有自己的任务定义,而且很难将其用于复杂的整合管道。因此,对整个数据整合进程的端到端管道进行评估仍然是一个难以实现的目标。在这项工作中,我们介绍了基于真实世界数据集的第一个基准阿拉斯加,以支持数据整合管道无缝的多重任务(及其变异)。数据集由71个具有数千个不同产品属性的电子商务网站的~70k 差异产品规格组成。我们的基准是分析元数据,一套具有不同特点的预先确定的使用案例,以及广泛的手工整理的地面真相。我们通过侧重于两个关键数据整合任务的几种变式,即Schema Matching和实体分辨率,展示了我们的基准的灵活性。我们的实验表明,我们的基准使得能够评估各种以前难以比较、促进数据整合的方法。