We illustrate how a calibrated model can help balance common trade-offs in task-oriented parsing. In a simulated annotator-in-the-loop experiment, we show that well-calibrated confidence scores allow us to balance cost with annotator load, improving accuracy with a small number of interactions. We then examine how confidence scores can help optimize the trade-off between usability and safety. We show that confidence-based thresholding can substantially reduce the number of incorrect low-confidence programs executed; however, this comes at a cost to usability. We propose the DidYouMean system which better balances usability and safety.
翻译:我们阐述了一种校准模型如何帮助平衡任务导向解析中常见的权衡问题。在模拟的循环注释实验中,我们展示了校准良好的置信度分数如何帮助我们平衡成本和注释负担,从而在小数量交互中提高准确率。然后,我们研究置信度分数如何帮助优化可用性和安全性之间的权衡。我们展示了基于置信度阈值的方法如何极大地减少执行不正确的低置信度程序的数量;然而,这会对使用性造成影响。我们提出了 DidYouMean 系统,更好地平衡了可用性和安全性。