Sequence generation models are increasingly being used to translate language into executable programs, i.e. to perform executable semantic parsing. The fact that semantic parsing aims to execute actions in the real world motivates developing safe systems, which in turn makes measuring calibration -- a central component to safety -- particularly important. We investigate the calibration of common generation models across four popular semantic parsing datasets, finding that it varies across models and datasets. We then analyze factors associated with calibration error and release new confidence-based challenge splits of two parsing datasets. To facilitate the inclusion of calibration in semantic parsing evaluations, we release a library for computing calibration metrics.
翻译:----
配准解释:语义解析中的置信度估计
翻译后的摘要:
序列生成模型越来越多地用于将语言翻译成可执行程序,即执行语义解析。语义解析旨在执行现实世界中的操作,这促使开发安全系统,从而使测量配准——安全的核心组件尤为重要。我们研究了四个流行的语义解析数据集中常用生成模型的校准性,并发现其在模型和数据集间存在差异。然后我们分析了与配准误差相关的因素,并发布了两个解析数据集的基于置信度的挑战分割。为了促进将配准纳入语义解析评估,我们发布了一个计算校准度量的库。