Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.
翻译:发现出版物与它们所使用的数据集之间的权威性联系,这可以是一个劳动密集型过程; 我们引入一种自然语言处理管道,检索和审查出版物,以便非正式地引用研究数据集,从而补充数据图书管理员的工作; 我们首先描述编程的各组成部分,然后将其用于扩大一个权威性文献目录,将数千份社会科学研究与它们所使用的数据相关出版物联系起来; 编程中增加了文献的回顾,以便审查是否纳入与数据有关的出版物汇编,并有可能发现规模化的非正式数据参考; 我们贡献了(1) 一个名为实体识别(NER)的新颖模型,可靠地检测了非正式数据参考,(2) 将社会科学文献中的项目与它们参考的数据集连接在一起的数据集; 这些贡献加在一起,使得今后关于数据参考、数据引用网络和数据再利用的工作得以进行。