To translate natural language questions into executable database queries, most approaches rely on a fully annotated training set. Annotating a large dataset with queries is difficult as it requires query-language expertise. We reduce this burden using grounded in databases intermediate question representations. These representations are simpler to collect and were originally crowdsourced within the Break dataset (Wolfson et al., 2020). Our pipeline consists of two parts: a neural semantic parser that converts natural language questions into the intermediate representations and a non-trainable transpiler to the SPARQL query language (a standard language for accessing knowledge graphs and semantic web). We chose SPARQL because its queries are structurally closer to our intermediate representations (compared to SQL). We observe that the execution accuracy of queries constructed by our model on the challenging Spider dataset is comparable with the state-of-the-art text-to-SQL methods trained with annotated SQL queries. Our code and data are publicly available (see https://github.com/yandex-research/sparqling-queries).
翻译:将自然语言问题转换成可执行的数据库查询,大多数方法都依赖于完全附加说明的培训组。指出大型数据集需要查询是困难的,因为它需要查询语言的专业知识。我们用数据库中间问题表示来减轻这一负担。这些表述比较简单,可以收集,最初在断裂数据集内是众包(Wolfson等人,2020年)。我们的管道由两部分组成:将自然语言问题转换成中间表达方式的神经语义解剖析器,以及将不可传输的传输器转换成SPARQL查询语言(一种获取知识图表和语义网站的标准语言)。我们选择了SPARQL,因为其查询在结构上离我们中间陈述(与SQL相比)更近。我们注意到,我们在挑战性蜘蛛数据集模型上构建的查询的准确性与经过附加说明的SQL查询的状态-艺术文本到SQL方法相当。我们的代码和数据是公开的(见https://github.com/yandex-research/sparqling-queries)。