S2AMP:从出版物中获取的关于学术指导的高度管理数据集 (S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications)

Mentorship is a critical component of academia, but is not as visible as publications, citations, grants, and awards. Despite the importance of studying the quality and impact of mentorship, there are few large representative mentorship datasets available. We contribute two datasets to the study of mentorship. The first has over 300,000 ground truth academic mentor-mentee pairs obtained from multiple diverse, manually-curated sources, and linked to the Semantic Scholar (S2) knowledge graph. We use this dataset to train an accurate classifier for predicting mentorship relations from bibliographic features, achieving a held-out area under the ROC curve of 0.96. Our second dataset is formed by applying the classifier to the complete co-authorship graph of S2. The result is an inferred graph with 137 million weighted mentorship edges among 24 million nodes. We release this first-of-its-kind dataset to the community to help accelerate the study of scholarly mentorship: \url{https://github.com/allenai/S2AMP-data}

翻译：指导是学术界的一个关键组成部分,但不像出版物、引文、赠款和奖项那样引人注目。尽管研究指导的质量和影响十分重要,但现有的具有代表性的辅导数据集却很少。我们为辅导研究提供了两个数据集。第一个数据集有30多万对来自多种不同、手工加工来源的地面真相学术导师-导师-导师-导师-导师-导师配对,这些配对与语义学学者知识图(S2)相连。我们利用该数据集培训精确的分类员,从书目特征中预测辅导关系,在0.96的ROC曲线下实现一个禁区。我们的第二个数据集是通过将分类员应用S2的完整共同作者图表而形成的。其结果是一个推断图,在2 400万个节点之间有1.37亿个加权导师边缘。我们向社区发布这一首个其类数据集,以帮助加速学术导师研究:https://github.com/allenai/S2AMP-data。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日