We address the challenge of building domain-specific knowledge models for industrial use cases, where labelled data and taxonomic information is initially scarce. Our focus is on inductive link prediction models as a basis for practical tools that support knowledge engineers with exploring text collections and discovering and linking new (so-called open-world) entities to the knowledge graph. We argue that - though neural approaches to text mining have yielded impressive results in the past years - current benchmarks do not reflect the typical challenges encountered in the industrial wild properly. Therefore, our first contribution is an open benchmark coined IRT2 (inductive reasoning with text) that (1) covers knowledge graphs of varying sizes (including very small ones), (2) comes with incidental, low-quality text mentions, and (3) includes not only triple completion but also ranking, which is relevant for supporting experts with discovery tasks. We investigate two neural models for inductive link prediction, one based on end-to-end learning and one that learns from the knowledge graph and text data in separate steps. These models compete with a strong bag-of-words baseline. The results show a significant advance in performance for the neural approaches as soon as the available graph data decreases for linking. For ranking, the results are promising, and the neural approaches outperform the sparse retriever by a wide margin.
翻译:我们处理为工业使用案例建立特定领域知识模型的挑战,在这些案例中,标签数据和分类信息最初很少。我们的重点是作为实用工具支持知识工程师探索文本收藏和发现新的(所谓的开放世界)实体并将其与知识图联系起来的隐含链接预测模型,我们争辩说,尽管对文字采矿的神经方法在过去几年中取得了令人印象深刻的成果,但目前的基准并不反映工业野生生物遇到的典型挑战。因此,我们的第一个贡献是开放式基准(带文字的推理),该基准(1)涵盖不同大小(包括非常小的)的知识图表,(2)附带的、低质量的文本提及,(3)不仅包括三个完成阶段,而且还包括等级,这与支持发现任务的专家有关。我们调查了两个内向型链接预测的神经模型,一个基于端到端学习,一个从知识图表和文本数据中学习,分步骤。这些模型与一个强大的词袋基线竞争。这些模型显示神经系统方法的性能显著进步,很快通过可变形的图像模型,将螺旋方法作为可变式数据流成。