Tables in scientific papers contain a wealth of valuable knowledge for the scientific enterprise. To help the many of us who frequently consult this type of knowledge, we present Tab2Know, a new end-to-end system to build a Knowledge Base (KB) from tables in scientific papers. Tab2Know addresses the challenge of automatically interpreting the tables in papers and of disambiguating the entities that they contain. To solve these problems, we propose a pipeline that employs both statistical-based classifiers and logic-based reasoning. First, our pipeline applies weakly supervised classifiers to recognize the type of tables and columns, with the help of a data labeling system and an ontology specifically designed for our purpose. Then, logic-based reasoning is used to link equivalent entities (via sameAs links) in different tables. An empirical evaluation of our approach using a corpus of papers in the Computer Science domain has returned satisfactory performance. This suggests that ours is a promising step to create a large-scale KB of scientific knowledge.
翻译:科学文件中的表格包含大量科学企业的宝贵知识。 为了帮助许多经常咨询这类知识的人,我们介绍Tab2Know,这是从科学论文的表格中建立一个知识库(KB)的一个新的端对端系统。Tab2Know处理自动解释纸面表格和使其所含实体脱钩的挑战。为了解决这些问题,我们建议建立一个管道,既采用基于统计的分类方法,又采用基于逻辑的推理方法。首先,我们的管道在数据标签系统和专为我们设计的本体学帮助下,运用监管薄弱的分类师来识别表格和列的类型。然后,利用基于逻辑的推理将不同表格中的等同实体(通过相同的As链接)联系起来。对我们使用计算机科学领域各种文件的方法进行的经验评估,已经恢复了令人满意的业绩。这说明,我们在创建大规模科学知识KB方面迈出了有希望的步骤。