Mentorship is a critical component of academia, but is not as visible as publications, citations, grants, and awards. Despite the importance of studying the quality and impact of mentorship, there are few large representative mentorship datasets available. We contribute two datasets to the study of mentorship. The first has 300,000 ground truth academic mentor-mentee pairs obtained from multiple diverse, manually-curated sources, and linked to the Semantic Scholar (S2) knowledge graph. We use this dataset to train an accurate classifier for predicting mentorship relations from bibliographic features, achieving a held-out area under the ROC curve of 0.96. Our second dataset is formed by applying the classifier to the complete co-authorship graph of S2. The result is an inferred graph with 137 million weighted mentorship edges among 24 million nodes. We release this first-of-its-kind dataset to the community to help accelerate the study of scholarly mentorship: \url{https://github.com/allenai/S2AMP-data}
翻译:指导是学术界的一个关键组成部分,但不像出版物、引文、赠款和奖项那样引人注目。尽管研究指导的质量和影响十分重要,但现有的具有代表性的辅导数据集却很少。我们为辅导研究提供了两个数据集。第一个数据集有30万对来自多种不同、手工加工来源的地面真理学术导师-导师-导师-导师-导师配对,与语义学学者(S2)知识图(S2)相连。我们利用该数据集培训精确的分类员,从书目特征中预测辅导关系,在0.96的ROC曲线下实现一个禁区。我们的第二个数据集是通过将分类员应用S2的完整共同作者图表而形成的。其结果是一个推断图,在2 400万个节点之间有1.37亿个加权导师边缘。我们向社区发布这一首个其同类数据集,以帮助加速学术辅导研究:https://github.com/allenai/S2AMP-data}