S2AMP:从出版物中获取的关于学术指导的高度管理数据集 (S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications)

Mentorship is a critical component of academia, but is not as visible as publications, citations, grants, and awards. Despite the importance of studying the quality and impact of mentorship, there are few large representative mentorship datasets available. We contribute two datasets to the study of mentorship. The first has 300,000 ground truth academic mentor-mentee pairs obtained from multiple diverse, manually-curated sources, and linked to the Semantic Scholar (S2) knowledge graph. We use this dataset to train an accurate classifier for predicting mentorship relations from bibliographic features, achieving a held-out area under the ROC curve of 0.96. Our second dataset is formed by applying the classifier to the complete co-authorship graph of S2. The result is an inferred graph with 137 million weighted mentorship edges among 24 million nodes. We release this first-of-its-kind dataset to the community to help accelerate the study of scholarly mentorship: \url{https://github.com/allenai/S2AMP-data}

翻译：指导是学术界的一个关键组成部分,但不像出版物、引文、赠款和奖项那样引人注目。尽管研究指导的质量和影响十分重要,但现有的具有代表性的辅导数据集却很少。我们为辅导研究提供了两个数据集。第一个数据集有30万对来自多种不同、手工加工来源的地面真理学术导师-导师-导师-导师-导师配对,与语义学学者(S2)知识图(S2)相连。我们利用该数据集培训精确的分类员,从书目特征中预测辅导关系,在0.96的ROC曲线下实现一个禁区。我们的第二个数据集是通过将分类员应用S2的完整共同作者图表而形成的。其结果是一个推断图,在2 400万个节点之间有1.37亿个加权导师边缘。我们向社区发布这一首个其同类数据集,以帮助加速学术辅导研究:https://github.com/allenai/S2AMP-data}

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日