采掘非液化石油开采任务生成模型的一致性事项 (Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks)

Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. With these findings, we would like to call for more attention on how tokenization should be done when solving extractive tasks and recommend applying consistent tokenization during training.

翻译：例如,在采掘问题解答(QA)中,基因模型不断产生最先进的结果。在这项工作中,我们确定了在培训这些模型时通常忽视的象征化不一致问题。这个问题损害到这些投入和产出被代售商以不一致的象征形式表示后,这些任务的采掘性质,从而导致性能下降和幻觉。我们建议对这个问题采取简单而有效的解决办法,对采掘问题QA进行案例研究。我们表明,在一致的象征化情况下,模型在内部和外部数据集中都表现得更好,当BART模型在SuAD上接受培训并评价8 QA数据集时,其平均收益为+1.7 F2。此外,模型的趋同速度更快,更不可能产生文字外的答案。有了这些发现,我们想呼吁更多关注在解决采掘任务时如何实现象征性化,并建议在培训期间采用一致的象征化。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日