The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io.
翻译:成功引发了全球范围内深度学习的热潮,提供了许多用于关系表任务(如数据准备和搜索)的表格表示模型,这些模型经过了在大型表格语料库上的训练。现有的表格语料库主要包含从HTML页面中提取的表格,这限制了表示离线数据库表格的能力。为了训练和评估高容量模型以及超越Web应用的应用,我们需要具有类似于关系型数据库表格的表格资源。本文介绍了GitTables,这是一个从GitHub中提取的100万个关系表。我们持续的整理旨在将这个语料库增长到至少1000万个表格。GitTables的分析表明,它的结构、内容和主题覆盖与现有的表格语料库有很大的不同。我们使用Schema.org和DBpedia为GitTables中的表格列注释了语义类型、层级关系和描述。我们在T2Dv2基准测试上对我们的注释管道进行评估,结果表明我们的方法提供了与人工注释相当的结果。我们提供了GitTables的三个应用程序,证明了它在学习语义类型检测模型、模式完成方法以及表格-KG匹配、数据搜索和准备基准方面的价值。我们在https://gittables.github.io上提供语料库和代码。