Sequence generation models are increasingly being used to translate language into executable programs, i.e. to perform executable semantic parsing. The fact that semantic parsing aims to execute actions in the real world motivates developing safe systems, which in turn makes measuring calibration -- a central component to safety -- particularly important. We investigate the calibration of common generation models across four popular semantic parsing datasets, finding that it varies across models and datasets. We then analyze factors associated with calibration error and release new confidence-based challenge splits of two parsing datasets. To facilitate the inclusion of calibration in semantic parsing evaluations, we release a library for computing calibration metrics.
翻译:序列生成模型越来越被用于将语言翻译为可执行程序,即执行语义解析。语义解析的实现目标是执行真实世界的动作,这促使开发安全系统成为特别重要的问题,进而使测量校准(即安全的核心组成部分)变得尤为重要。我们研究了四个常见语义解析数据集中常用生成模型的校准情况,发现它随着模型和数据集的不同而变化。然后我们分析了与校准误差相关的因素,并发布两个解析数据集的基于置信度的新挑战拆分。为了方便在语义解析评估中包括校准功能,我们发布了一个计算校准度量的库。