利用受监督学习模式预测社会和行为科学文件的可复制性 (Predicting the Reproducibility of Social and Behavioral Science Papers Using Supervised Learning Models)

Jian Wu,Rajal Nivargi,Sree Sai Teja Lanka,Arjun Manoj Menon,Sai Ajay Modukuri,Nishanth Nakshatri,Xin Wei,Zhuoer Wang,James Caverlee,Sarah M. Rajtmajer,C. Lee Giles

from arxiv, 17 pages, 8 figures, a draft to be submitted to JCDL'21

In recent years, significant effort has been invested verifying the reproducibility and robustness of research claims in social and behavioral sciences (SBS), much of which has involved resource-intensive replication projects. In this paper, we investigate prediction of the reproducibility of SBS papers using machine learning methods based on a set of features. We propose a framework that extracts five types of features from scholarly work that can be used to support assessments of reproducibility of published research claims. Bibliometric features, venue features, and author features are collected from public APIs or extracted using open source machine learning libraries with customized parsers. Statistical features, such as p-values, are extracted by recognizing patterns in the body text. Semantic features, such as funding information, are obtained from public APIs or are extracted using natural language processing models. We analyze pairwise correlations between individual features and their importance for predicting a set of human-assessed ground truth labels. In doing so, we identify a subset of 9 top features that play relatively more important roles in predicting the reproducibility of SBS papers in our corpus. Results are verified by comparing performances of 10 supervised predictive classifiers trained on different sets of features.

翻译：近年来,我们投入了大量努力,核查社会和行为科学(SBS)研究主张的再生性和稳健性,其中许多涉及资源密集的复制项目;在本文件中,我们调查利用一套特征的机器学习方法对SBS论文进行再复制的预测;我们建议了一个框架,从学术工作中提取五类特征,用以支持对已公布的研究主张的再生性进行评估;从公共API中收集生物计量特征、地点特征和作者特征,或利用带有定制分析器的公开源码机器学习图书馆提取这些特征;统计特征,如P值,通过确认正文中的特征来提取;供资信息等语义特征,从公共API中获取,或利用自然语言处理模型提取;我们分析了个别特征及其在预测一套人类评估的地面真相标签方面的重要性之间的对等关系;我们这样做,我们确定了9个顶级特征的子集,在预测SBS文件的再生性方面起着相对重要的作用;通过对所培训的10项分类的特征进行监督,对结果进行业绩的比较。