Large public knowledge graphs, like Wikidata, contain billions of statements about tens of millions of entities, thus inspiring various use cases to exploit such knowledge graphs. However, practice shows that much of the relevant information that fits users' needs is still missing in Wikidata, while current linked open data (LOD) tools are not suitable to enrich large graphs like Wikidata. In this paper, we investigate the potential of enriching Wikidata with structured data sources from the LOD cloud. We present a novel workflow that includes gap detection, source selection, schema alignment, and semantic validation. We evaluate our enrichment method with two complementary LOD sources: a noisy source with broad coverage, DBpedia, and a manually curated source with narrow focus on the art domain, Getty. Our experiments show that our workflow can enrich Wikidata with millions of novel statements from external LOD sources with a high quality. Property alignment and data quality are key challenges, whereas entity alignment and source selection are well-supported by existing Wikidata mechanisms. We make our code and data available to support future work.
翻译:大型公众知识图表,如维基数据,包含数十亿个关于数千万个实体的报表,从而激发了利用这些知识图表的各种使用案例。然而,实践表明,维基数据中仍然缺少许多适合用户需要的相关信息,而当前链接的开放数据工具(LOD)并不适合丰富像维基数据这样的大图表。在本文中,我们研究了用来自LOD云的结构性数据源丰富维基数据的潜力。我们展示了一个新的工作流程,其中包括差距检测、源选择、化学对齐和语义验证。我们用两个补充性LOD来源评估了我们的浓缩方法:一个覆盖广的噪音源,DBpedia,以及一个狭隘关注艺术领域的手工整理源,Getty。我们的实验表明,我们的工作流程可以用来自外部LOD来源的数百万新声明以高质量的质量丰富维基数据。财产对齐和数据质量是关键的挑战,而实体对齐和来源选择得到现有的维基数据机制的支持。我们提供了我们的代码和数据以支持未来的工作。