评估语言模型对知识库完成的作用 (Evaluating Language Models for Knowledge Base Completion)

Structured knowledge bases (KBs) are a foundation of many intelligent applications, yet are notoriously incomplete. Language models (LMs) have recently been proposed for unsupervised knowledge base completion (KBC), yet, despite encouraging initial results, questions regarding their suitability remain open. Existing evaluations often fall short because they only evaluate on popular subjects, or sample already existing facts from KBs. In this work, we introduce a novel, more challenging benchmark dataset, and a methodology tailored for a realistic assessment of the KBC potential of LMs. For automated assessment, we curate a dataset called WD-KNOWN, which provides an unbiased random sample of Wikidata, containing over 3.9 million facts. In a second step, we perform a human evaluation on predictions that are not yet in the KB, as only this provides real insights into the added value over existing KBs. Our key finding is that biases in dataset conception of previous benchmarks lead to a systematic overestimate of LM performance for KBC. However, our results also reveal strong areas of LMs. We could, for example, perform a significant completion of Wikidata on the relations nativeLanguage, by a factor of ~21 (from 260k to 5.8M) at 82% precision, usedLanguage, by a factor of ~2.1 (from 2.1M to 6.6M) at 82% precision, and citizenOf by a factor of ~0.3 (from 4.2M to 5.3M) at 90% precision. Moreover, we find that LMs possess surprisingly strong generalization capabilities: even on relations where most facts were not directly observed in LM training, prediction quality can be high.

翻译：结构化知识库是许多智能应用程序的基础，但是它们通常是不完整的。最近提出了使用语言模型（LM）进行无监督知识库完成（KBC）的方法，尽管存在鼓舞人心的初步结果，但其适用性仍有疑问。现有的评估往往存在不足，因为它们只评估热门主题或从知识库已有事实中进行抽样。在这项工作中，我们介绍了一个新的更具挑战性的基准数据集和针对评估LM KBC潜力的现实评估方法。为了进行自动化评估，我们策划了一个称为WD-KNOWN的数据集，提供Wikidata的无偏随机样本，包含超过390万个事实。在第二步中，我们对还不在知识库中的预测进行了人类评估，因为这才能提供有关相对于现有知识库的附加价值的真实见解。我们的主要发现是，先前基准测试数据集构思中的偏差导致对LM进行KBC性能的系统高估。但是，我们的结果还揭示了LM的强项。例如，我们可以在'nativeLanguage'关系上实现显著的Wikidata完成，因子为~21（从260k到5.8M），精度为82%，在'usedLanguage'关系上，因子为~2.1（从2.1M到6.6M），精度为82%，在'citizenOf'关系上，因子为~0.3（从4.2M到5.3M），精度为90%。此外，我们发现LM具有出乎意料的强大泛化能力：即使在大多数事实在LM训练中没有直接观察到的关系上，预测质量也可以很高。