References in patents to scientific literature provide relevant information for studying the relation between science and technological inventions. These references allow us to answer questions about the types of scientific work that leads to inventions. Most prior work analysing the citations between patents and scientific publications focussed on the front-page citations, which are well structured and provided in the metadata of patent archives such as Google Patents. In the 2019 paper by Verberne et al., the authors evaluate two sequence labelling methods for extracting references from patents: Conditional Random Fields (CRF) and Flair. In this paper we extend that work, by (1) improving the quality of the training data and (2) applying BERT-based models to the problem. We use error analysis throughout our work to find problems in the dataset, improve our models and reason about the types of errors different models are susceptible to. We first discuss the work by Verberne et al. and other related work in Section2. We describe the improvements we make in the dataset, and the new models proposed for this task. We compare the results of our new models with previous results, both on the labelled dataset and a larger unlabelled corpus. We end with a discussion on the characteristics of the results of our new models, followed by a conclusion. Our code and improved dataset are released under an open-source license on github.
翻译:科学文献的专利参考为研究科学技术发明之间的关系提供了相关信息。这些引用使我们得以回答关于导致发明的科学工作类型的问题。大多数先前的工作都分析了专利与科学出版物之间引证的问题,这些引证主要以前页引用为主,这些引证结构完善,并在诸如谷歌专利等专利档案的元数据中提供。在Verberne等人编写的2019年论文中,作者评估了从专利提取引证的两种序列标签方法:有条件随机场和Flair。在本文件中,我们扩展了这项工作,:(1) 提高了培训数据的质量,(2) 应用了基于BERT的模型来解决这个问题。我们在整个工作中使用错误分析来找出数据集中的问题,改进了我们关于不同模型类型错误的模型和理由。我们首先讨论了Verberne等人的工作,以及第2节中的其他相关工作。我们描述了我们在数据集方面作出的改进,并为这项任务提出了新的模型。我们将我们的新模型的结果与以前的结果进行了比较,既包括贴标签数据集,又采用了以BERERT为基础的模型。我们根据新的数据格式进行的一项讨论,最后,我们根据我们所公布的结果,我们用了一个改进了一种新的格式。