We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing, which produces semantic outputs based on the analysis of input text through constrained decoding of a prompted or fine-tuned language model. Developers of pretrained language models currently benchmark on classification, span extraction and free-text generation tasks. Semantic parsing is neglected in language model evaluation because of the complexity of handling task-specific architectures and representations. Recent work has shown that generation from a prompted or fine-tuned language model can perform well at semantic parsing when the output is constrained to be a valid semantic representation. BenchCLAMP includes context-free grammars for six semantic parsing datasets with varied output meaning representations, as well as a constrained decoding interface to generate outputs covered by these grammars. We provide low, medium, and high resource splits for each dataset, allowing accurate comparison of various language models under different data regimes. Our benchmark supports both prompt-based learning as well as fine-tuning, and provides an easy-to-use toolkit for language model developers to evaluate on semantic parsing.
翻译:我们引入了Bench CLACMP, 这是一种评估受约束的Lansguage模型剖析的基准,该模型通过限制地解码推动或微调的语言模型,根据对输入文本的分析,通过限制地解码,产生语义输出输出结果,产生语义分析模型的开发者目前关于分类、抽取和自由产生文本任务的基准。语言模型评估中,由于处理具体任务架构和表达方式的复杂性,语义分析被忽略了。最近的工作表明,在产出受限制地成为有效的语义代表时,由激励或微调的语言模型生成的语义分析效果良好。CLACAMP包括6个具有不同输出含义的语义解析数据集的无上下文语法语法,以及用于产生这些语义表达方式覆盖的产出的受限制的解码界面。我们为每个数据集提供了低、中、高资源分割,允许对不同数据制度下的各种语言模型进行准确比较。我们的基准支持快速学习和微调,并为语言模型开发者评价语言模型提供一个易于使用的工具包。