Scientific document understanding is challenging as the data is highly domain specific and diverse. However, datasets for tasks with scientific text require expensive manual annotation and tend to be small and limited to only one or a few fields. At the same time, scientific documents contain many potential training signals, such as citations, which can be used to build large labelled datasets. Given this, we present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source. To accomplish this, we introduce CiteWorth, a large, contextualized, rigorously cleaned labelled dataset for cite-worthiness detection built from a massive corpus of extracted plain-text scientific documents. We show that CiteWorth is high-quality, challenging, and suitable for studying problems such as domain adaptation. Our best performing cite-worthiness detection model is a paragraph-level contextualized sentence labelling model based on Longformer, exhibiting a 5 F1 point improvement over SciBERT which considers only individual sentences. Finally, we demonstrate that language model fine-tuning with cite-worthiness as a secondary task leads to improved performance on downstream scientific document understanding tasks.
翻译:科学文件具有许多潜在的培训信号,例如引文,可以用来建立大标记数据集。有鉴于此,我们用英文对引用值检测进行深入研究,在英文中,对引用值检测进行标记,以说明其是否引用外部来源。为完成这项工作,我们引入CiteWorth,这是一个大型的、背景化的、经过严格清理的、从大量提取的纯文本科学文件中建立起来的可引用性检测数据集。我们证明CiteWorth高质量、具有挑战性,适合研究领域适应等问题。我们最好的引文检测模型是以Longexe为根据的段落级相关语系标签模型,比SciBERT只考虑单项句子,显示5F1点的改进。最后,我们证明语言模型与引用值作为次要任务的次级任务进行微调,有助于改进下游科学文件理解任务的业绩。