Task-oriented semantic parsing is increasingly being used in user-facing applications, making measuring the calibration of parsing models especially important. We examine the calibration characteristics of six models across three model families on two common English semantic parsing datasets, finding that many models are reasonably well-calibrated and that there is a trade-off between calibration and performance. Based on confidence scores across three models, we propose and release new challenge splits of the two datasets we examine. We then illustrate the ways a calibrated model can be useful in balancing common trade-offs in task-oriented parsing. In a simulated annotator-in-the-loop experiment, we show that using model confidence allows us to improve performance by 9.6% (absolute) with interactions on only 2.2% of tokens. Using sequence-level confidence scores, we then examine how we can optimize trade-off between a parser's usability and safety. We show that confidence-based thresholding can reduce the number of incorrect low-confidence programs executed by 76%; however, this comes at a cost to usability. We propose the DidYouMean system which balances usability and safety. We conclude by calling for calibration to be included in the evaluation of semantic parsing systems, and release a library for computing calibration metrics.
翻译:以任务为导向的语义解析正在越来越多地用于用户偏差应用中,使得测量分析模型的校准变得特别重要。 我们用两个通用的英国语义解析数据集,检查了三个模型组中六个模型的6个模型的校准特点,发现许多模型的校准程度相当合理,校准和性能之间有权衡。 根据三个模型的可信度分数,我们提出并发布我们所审查的两个数据集的新挑战分解。 然后,我们展示了一个校准模型如何有助于平衡任务导向解析中共同的权衡。 在模拟的批注-即loop实验中,我们显示使用模型信心使我们的性能提高9.6%(绝对性),只有2.2%的符号上的互动。我们使用序列级信任分数,然后研究我们如何能够优化一个分析器的可用性和安全性。我们显示基于信任的门槛可以减少由76 % 执行的不正确的低可信度程序的数量;然而,我们用模拟的批注-loop实验显示,使用模型能让我们提高性能的校准度系统。我们建议一个校准性标准系统,让我们得出一个校准。