Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. With these findings, we would like to call for more attention on how tokenization should be done when solving extractive tasks and recommend applying consistent tokenization during training.
翻译:例如,在采掘问题解答(QA)中,基因模型不断产生最先进的结果。在这项工作中,我们确定了在培训这些模型时通常忽视的象征化不一致问题。这个问题损害到这些投入和产出被代售商以不一致的象征形式表示后,这些任务的采掘性质,从而导致性能下降和幻觉。我们建议对这个问题采取简单而有效的解决办法,对采掘问题QA进行案例研究。我们表明,在一致的象征化情况下,模型在内部和外部数据集中都表现得更好,当BART模型在SuAD上接受培训并评价8 QA数据集时,其平均收益为+1.7 F2。此外,模型的趋同速度更快,更不可能产生文字外的答案。有了这些发现,我们想呼吁更多关注在解决采掘任务时如何实现象征性化,并建议在培训期间采用一致的象征化。