阿拉斯加:数据整合任务灵活基准 (Alaska: A Flexible Benchmark for Data Integration Tasks) - 专知论文

会员服务 ·

0

Integration · entity · Extensibility · CASES · 真实值 ·

2021 年 2 月 3 日

Alaska: A Flexible Benchmark for Data Integration Tasks

翻译：阿拉斯加:数据整合任务灵活基准

Valter Crescenzi,Andrea De Angelis,Donatella Firmani,Maurizio Mazzei,Paolo Merialdo,Federico Piai,Divesh Srivastava

Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task definition and it can be difficult to leverage them for complex integration pipelines. As a result, evaluating end-to-end pipelines for the entire data integration process is still an elusive goal. In this work, we present Alaska, the first benchmark based on real-world dataset to support seamlessly multiple tasks (and their variants) of the data integration pipeline. The dataset consists of ~70k heterogeneous product specifications from 71 e-commerce websites with thousands of different product attributes. Our benchmark comes with profiling meta-data, a set of pre-defined use cases with diverse characteristics, and an extensive manually curated ground truth. We demonstrate the flexibility of our benchmark by focusing on several variants of two crucial data integration tasks, Schema Matching and Entity Resolution. Our experiments show that our benchmark enables the evaluation of a variety of methods that previously were difficult to compare, and can foster the design of more holistic data integration solutions.

翻译：数据整合是数据管理界的长期利益,它有许多不同的应用,包括商业、科学和政府。我们最近目睹了具体数据整合任务,如实体决议等具体数据整合任务取得令人印象深刻的成果,因为基准的可用性越来越大。这些基准的局限性在于它们通常具有自己的任务定义,而且很难将其用于复杂的整合管道。因此,对整个数据整合进程的端到端管道进行评估仍然是一个难以实现的目标。在这项工作中,我们介绍了基于真实世界数据集的第一个基准阿拉斯加,以支持数据整合管道无缝的多重任务(及其变异)。数据集由71个具有数千个不同产品属性的电子商务网站的~70k 差异产品规格组成。我们的基准是分析元数据,一套具有不同特点的预先确定的使用案例,以及广泛的手工整理的地面真相。我们通过侧重于两个关键数据整合任务的几种变式,即Schema Matching和实体分辨率,展示了我们的基准的灵活性。我们的实验表明,我们的基准使得能够评估各种以前难以比较、促进数据整合的方法。

0

相关内容

Integration

Integration：Integration, the VLSI Journal。 Explanation：集成，VLSI杂志。 Publisher：Elsevier。 SIT：http://dblp.uni-trier.de/db/journals/integration/

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

carla 学习笔记

carla 学习笔记

CreateAMind

9+阅读 · 2018年2月7日

已删除

将门创投

5+阅读 · 2017年11月22日

【计算机类】期刊专刊/国际会议截稿信息6条

【计算机类】期刊专刊/国际会议截稿信息6条

Call4Papers

3+阅读 · 2017年10月13日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Learning to Optimize: A Primer and A Benchmark

Arxiv

0+阅读 · 2021年3月23日

FuxiCTR: An Open Benchmark for Click-Through Rate Prediction

Arxiv

8+阅读 · 2020年9月12日

A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs

Arxiv

3+阅读 · 2020年7月20日

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Arxiv

3+阅读 · 2020年3月24日

Graph Transformer Networks

Arxiv

15+阅读 · 2020年2月5日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

Interaction Embeddings for Prediction and Explanation in Knowledge Graphs

Arxiv

7+阅读 · 2019年3月12日

AceKG: A Large-scale Knowledge Graph for Academic Data Mining

AceKG: A Large-scale Knowledge Graph for Academic Data Mining

Arxiv

6+阅读 · 2018年8月7日

Simple and Effective Semi-Supervised Question Answering

Arxiv

5+阅读 · 2018年4月2日

A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

Arxiv

5+阅读 · 2018年2月14日

VIP会员

文章信息

相关主题

相关VIP内容

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

新书册《几何深度学习的数学基础》

中程单向攻击无人机的战略意义：俄乌战争启示

在无标注条件下适配视觉—语言模型：全面综述

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

carla 学习笔记

carla 学习笔记

CreateAMind

9+阅读 · 2018年2月7日

已删除

将门创投

5+阅读 · 2017年11月22日

【计算机类】期刊专刊/国际会议截稿信息6条

【计算机类】期刊专刊/国际会议截稿信息6条

Call4Papers

3+阅读 · 2017年10月13日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Learning to Optimize: A Primer and A Benchmark

Arxiv

0+阅读 · 2021年3月23日

FuxiCTR: An Open Benchmark for Click-Through Rate Prediction

Arxiv

8+阅读 · 2020年9月12日

A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs

Arxiv

3+阅读 · 2020年7月20日

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Arxiv

3+阅读 · 2020年3月24日

Graph Transformer Networks

Arxiv

15+阅读 · 2020年2月5日

Language Models as Knowledge Bases?

Arxiv

6+阅读 · 2019年9月4日

Interaction Embeddings for Prediction and Explanation in Knowledge Graphs

Arxiv

7+阅读 · 2019年3月12日

AceKG: A Large-scale Knowledge Graph for Academic Data Mining

AceKG: A Large-scale Knowledge Graph for Academic Data Mining

Arxiv

6+阅读 · 2018年8月7日

Simple and Effective Semi-Supervised Question Answering

Arxiv

5+阅读 · 2018年4月2日

A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

Arxiv

5+阅读 · 2018年2月14日

微信扫码咨询专知VIP会员