The practical success of deep learning has sparked interest in improving relational table tasks, like data search, with models trained on large table corpora. Existing corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need additional resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 20M tables. We annotate table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. The corpus is available at https://gittables.github.io. Our analysis of GitTables shows that its structure, content, and topical coverage differ significantly from existing table corpora. We evaluate our annotation pipeline on hand-labeled tables from the T2Dv2 benchmark and find that our approach provides results on par with human annotations. We demonstrate a use case of GitTables by training a semantic type detection model on it and obtain high prediction accuracy. We also show that the same model trained on tables from theWeb generalizes poorly.
翻译:深层次学习的实际成功激发了人们对改进关系表任务的兴趣,例如数据搜索,模型在大表格上经过培训。现有的公司主要包含从HTML页面上提取的表格,限制了代表离线数据库表格的能力。为了培训和评估网外应用的高能力模型,我们需要额外的资源,表格与关系数据库表格相似。这里我们介绍GitTables,这是目前从GitHub提取的1.7M关系表格的集合体。我们继续整理的目的是将材料增加到至少20M表。我们在GitTables上用2K以上不同类型来自Schema.org和DBpedia的注解表列。我们的专栏说明包括语义类型、等级关系、范围类型和描述。该文表在https://gitables.github.io上可以查阅。我们对Gittables的分析显示,其结构、内容和专题覆盖与现有的表团大不相同。我们从T2Dv2基准和DB中评估我们手标的表格中的注管道管道。我们发现,我们的专栏说明由Sembbb基准2号基准和DBpeed Table Table Table提供高精确的图表提供高度预测结果。我们在Gigesttaltaltalogs shototototopal 展示在人类图表上也展示了高的模型显示了高型图表。