Data integration is a long-standing interest of the data management community and has many disparate applications, including business, government and Web search. We have recently witnessed impressive results in isolated data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks for training and testing purposes. Unfortunately, such benchmarks often come with their own task definition and it can be difficult to leverage them for more complex pipelines. As a result, evaluating automated pipelines for the entire data integration process is still an elusive goal. In this work, we present the Alaska benchmark, the first real-world dataset to support seamlessly multiple tasks and tasks variants of the data integration pipeline. It consists of a wide and heterogeneous selection of product specifications from different electronics e-commerce websites and providing hundreds of different product properties. Our benchmark comes with profiling meta-data, pre-defined use cases with different characteristics, and an extensive manually curated ground truth. We demonstrate the flexibility of our benchmark by focusing on two crucial data integration tasks, Schema Matching and Entity Resolution, and some of their popular variants. Our benchmark allows us to compare on the same stage a variety of methods that previously were difficult to picture together, and we hope that it can foster the design of more holistic data integration solutions.
翻译:数据整合是数据管理界的长期利益,并有许多不同的应用程序,包括商业、政府和网络搜索。我们最近目睹了孤立的数据整合任务,如实体决议等孤立的数据整合任务取得令人印象深刻的成果,这是因为为培训和测试目的提供了越来越多的基准;不幸的是,这些基准往往具有自己的任务定义,很难将其用于更为复杂的管道。结果,对整个数据整合过程的自动化管道进行评估仍然是一个难以实现的目标。在这项工作中,我们提出了阿拉斯加基准,这是第一个支持数据整合管道无缝多重任务和任务变异的真实世界数据集。它包括从不同的电子商务网站广泛和多样化地选择产品规格,并提供数百种不同的产品特性。我们的基准涉及剖析元数据,预先确定具有不同特点的使用案例,以及广泛的手工整理的地面真相。我们通过侧重于两个关键的数据整合任务,即Schema匹配和实体解析,以及其中的一些广受欢迎的变式,展示了我们基准的灵活性。我们的基准使我们能够在同一阶段比较以前难以共同设计的各种解决办法,我们希望它能够促进整体整合的数据设计。