Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representation of full-text content. To address these points, we propose a new version of the data set unarXive. We base our data processing pipeline and output format on two existing data sets, and improve on each of them. Our resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years. It furthermore has a more complete citation network than its predecessors and retains a richer representation of document structure as well as non-textual publication content such as mathematical notation. In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification. All data and source code is publicly available at https://github.com/IllDepence/unarXive.
翻译:大规模的学术出版物数据集是各种文献计量分析和自然语言处理(NLP)应用的基础。尤其是从出版物全文派生的数据集最近受到关注。虽然已经存在几个这样的数据集,但我们认为它们存在关于领域和时间覆盖范围、引用网络完整性和全文内容表达方面的主要缺陷。为解决这些问题,我们提出了一个新版数据集unarXive。我们的数据处理管线和输出格式基于两个现有数据集,并在每个数据集基础上进行改进。我们的数据集包括跨多个学科和32年的1.9M篇出版物。此外,它还具有比前任更完整的引用网络,并保留了更丰富的文档结构表达以及非文本出版内容(如数学表达式)。除了数据集外,我们还提供了用于引用建议和IMRaD分类的即用型训练/测试数据。所有数据和源代码均公开可用,网址为https://github.com/IllDepence/unarXive。