In tasks like question answering or text summarisation, it is essential to have background knowledge about the relevant entities. The information about entities - in particular, about long-tail or emerging entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses a transformer network to identify subject entities at the token-level and surpasses an existing approach in terms of performance while being bound by fewer limitations. Due to a flexible input format, it is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extracting 40 million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are incorporated in the most recent version of CaLiGraph.
翻译:在诸如答题或文本摘要等任务中,必须掌握有关实体的背景知识。在诸如DBpedia或CaLigraph等公开知识图表中,有关实体的信息,尤其是长尾实体或新兴实体的信息远未完成。在本文中,我们提出了一个办法,利用列名的半结构性质(如插图和表格)来识别列名项目的主要实体(即条目和行)。这些实体,我们称之为主体实体,可用来扩大知识图表的覆盖面。我们的方法使用变压器网络在象征性级别识别主体实体,在业绩方面超过了现有做法,但受较少限制。由于采用了灵活的输入格式,它适用于任何类型的列名,与以往的工作不同,它并不依赖实体边界作为投入。我们通过将其应用到完整的维基百科资料库并提取4 000万条关于估计精确度为71%和回顾77%的主题实体的提法来表明我们的方法。结果被纳入最新的CALIGraph的最新版本。