Bio-SODA:在没有培训数据的情况下对知识图解回答自然语言问题 (Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data)

The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets.

翻译：对结构化数据进行自然语言处理的问题,无论是在关系数据库还是语义网络界中,都已成为一个日益增长的研究领域,在关系数据库和语义网络界中,自然语言处理对知识图解(KGQA)的解答涉及大量努力。然而,许多这类方法要么具体针对使用DBpedia回答的开放式问题回答,要么要求大型培训数据集将自然语言问题翻译给SPARQL,以便查询知识图。因此,这些方法往往无法直接应用于没有事先培训数据的复杂科学数据集。在本文中,我们侧重于自然语言处理对科学数据集知识图表的挑战。特别是,我们引入了Bio-SODA,这是一个自然语言处理引擎,不需要以问答对等形式培训数据,以生成SPARQL查询。生物-SDA使用基于通用图表的方法将用户问题翻译到SPARQL候选查询的排名清单中。此外,生物-SODA使用新的排序算法,其中包括用于选择最佳SPARQ最不高科学的候选QQ。我们与现实-LD数据之间的实验,包括生物-Freal-LD数据,通过生物-reformals