unarXive 2022: 所有arXiv出版物的NLP预处理，包括结构化全文和引用网络 (unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network)

Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representation of full-text content. To address these points, we propose a new version of the data set unarXive. We base our data processing pipeline and output format on two existing data sets, and improve on each of them. Our resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years. It furthermore has a more complete citation network than its predecessors and retains a richer representation of document structure as well as non-textual publication content such as mathematical notation. In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification. All data and source code is publicly available at https://github.com/IllDepence/unarXive.

翻译：大规模的学术出版物数据集是各种文献计量分析和自然语言处理（NLP）应用的基础。尤其是从出版物全文派生的数据集最近受到关注。虽然已经存在几个这样的数据集，但我们认为它们存在关于领域和时间覆盖范围、引用网络完整性和全文内容表达方面的主要缺陷。为解决这些问题，我们提出了一个新版数据集unarXive。我们的数据处理管线和输出格式基于两个现有数据集，并在每个数据集基础上进行改进。我们的数据集包括跨多个学科和32年的1.9M篇出版物。此外，它还具有比前任更完整的引用网络，并保留了更丰富的文档结构表达以及非文本出版内容（如数学表达式）。除了数据集外，我们还提供了用于引用建议和IMRaD分类的即用型训练/测试数据。所有数据和源代码均公开可用，网址为https://github.com/IllDepence/unarXive。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

UIUC韩家炜：从海量非结构化文本中挖掘结构化知识

专知会员服务

98+阅读 · 2021年12月30日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日