Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they "understand" the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.
翻译:语言解释是一个组合性过程,其中更复杂语言结构的含义是从其组成部分的含义中推断出来的。大型语言模型具备卓越的语言解释能力,并已成功应用于通过将问题映射为SPARQL查询来解释问题。一个悬而未决的问题是这种解释过程的系统性如何。针对这一问题,本文提出了一个基准,用于研究LLMs解释问题的能力在多大程度上真正具有组合性。为此,我们基于DBpedia中的图模式生成了三个难度不同的数据集,并依赖Lemon词典进行语言化表达。我们的数据集以高度可控的方式创建,旨在测试LLMs在已见过原子构建块的情况下解释结构复杂问题的能力。这使我们能够评估LLMs在多大程度上能够解释那些它们“理解”其原子部分的复杂问题。我们使用不同规模的模型进行了实验,采用了多种提示和少样本优化技术以及微调方法。结果表明,随着与优化样本的偏离程度增加,宏观$F_1$值从$0.45$降至$0.26$,再降至$0.09$。即使在输入中向模型提供了所有必要信息,对于最低复杂度数据集,$F_1$分数也未超过$0.57$。因此我们得出结论:LLMs难以系统且组合性地解释问题并将其映射为SPARQL查询。