表:大比例关系表单位 (GitTables: A Large-Scale Corpus of Relational Tables)

The practical success of deep learning has sparked interest in improving relational table tasks, like data search, with models trained on large table corpora. Existing corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need additional resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 20M tables. We annotate table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. The corpus is available at https://gittables.github.io. Our analysis of GitTables shows that its structure, content, and topical coverage differ significantly from existing table corpora. We evaluate our annotation pipeline on hand-labeled tables from the T2Dv2 benchmark and find that our approach provides results on par with human annotations. We demonstrate a use case of GitTables by training a semantic type detection model on it and obtain high prediction accuracy. We also show that the same model trained on tables from theWeb generalizes poorly.

翻译：深层次学习的实际成功激发了人们对改进关系表任务的兴趣,例如数据搜索,模型在大表格上经过培训。现有的公司主要包含从HTML页面上提取的表格,限制了代表离线数据库表格的能力。为了培训和评估网外应用的高能力模型,我们需要额外的资源,表格与关系数据库表格相似。这里我们介绍GitTables,这是目前从GitHub提取的1.7M关系表格的集合体。我们继续整理的目的是将材料增加到至少20M表。我们在GitTables上用2K以上不同类型来自Schema.org和DBpedia的注解表列。我们的专栏说明包括语义类型、等级关系、范围类型和描述。该文表在https://gitables.github.io上可以查阅。我们对Gittables的分析显示,其结构、内容和专题覆盖与现有的表团大不相同。我们从T2Dv2基准和DB中评估我们手标的表格中的注管道管道。我们发现,我们的专栏说明由Sembbb基准2号基准和DBpeed Table Table Table提供高精确的图表提供高度预测结果。我们在Gigesttaltaltalogs shototototopal 展示在人类图表上也展示了高的模型显示了高型图表。

相关内容

MoDELS

关注 30

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/