Text-to-text generation models have increasingly become the go-to solution for a wide variety of sequence labeling tasks (e.g., entity extraction and dialog slot filling). While most research has focused on the labeling accuracy, a key aspect -- of vital practical importance -- has slipped through the cracks: understanding model confidence. More specifically, we lack a principled understanding of how to reliably gauge the confidence of a model in its predictions for each labeled span. This paper aims to provide some empirical insights on estimating model confidence for generative sequence labeling. Most notably, we find that simply using the decoder's output probabilities is not the best in realizing well-calibrated confidence estimates. As verified over six public datasets of different tasks, we show that our proposed approach -- which leverages statistics from top-$k$ predictions by a beam search -- significantly reduces calibration errors of the predictions of a generative sequence labeling model.
翻译:文本到文本生成模型日益成为多种序列标签任务(例如实体提取和对话框空档填充)的通向解决方案。 虽然大多数研究侧重于标签准确性,但一个关键方面 -- -- 至关重要的实际重要性 -- -- 已经从裂缝中滑落:理解模型信任度。更具体地说,我们对于如何可靠地衡量模型对每个标签跨度预测的信任度缺乏原则性的理解。本文件旨在提供一些经验见解,以估计模型对基因序列标签的信任度。最显著的是,我们发现仅仅使用解码器的输出概率并不是实现精确校准的信任估计的最佳方法。经过六套不同任务的公共数据集的验证,我们展示了我们的拟议方法 -- -- 即利用以波束搜索方式从最高至1千美元的预测中获取统计数据 -- -- 大大降低了基因化序列标签模型预测的校准错误。