CompoST：一个用于分析LLMs在QALD场景下组合性解释问题能力的基准 (CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting)

Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they "understand" the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.

翻译：语言解释是一个组合性过程，其中更复杂语言结构的含义是从其组成部分的含义中推断出来的。大型语言模型具备卓越的语言解释能力，并已成功应用于通过将问题映射为SPARQL查询来解释问题。一个悬而未决的问题是这种解释过程的系统性如何。针对这一问题，本文提出了一个基准，用于研究LLMs解释问题的能力在多大程度上真正具有组合性。为此，我们基于DBpedia中的图模式生成了三个难度不同的数据集，并依赖Lemon词典进行语言化表达。我们的数据集以高度可控的方式创建，旨在测试LLMs在已见过原子构建块的情况下解释结构复杂问题的能力。这使我们能够评估LLMs在多大程度上能够解释那些它们“理解”其原子部分的复杂问题。我们使用不同规模的模型进行了实验，采用了多种提示和少样本优化技术以及微调方法。结果表明，随着与优化样本的偏离程度增加，宏观$F_1$值从$0.45$降至$0.26$，再降至$0.09$。即使在输入中向模型提供了所有必要信息，对于最低复杂度数据集，$F_1$分数也未超过$0.57$。因此我们得出结论：LLMs难以系统且组合性地解释问题并将其映射为SPARQL查询。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日