关于研究守则质量和执行的大规模研究 (A large-scale study on research code quality and execution)

This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research reproducibility and reuse. First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020. Second, we execute the code in a clean runtime environment to assess its ease of reuse. Common coding errors were identified, and some of them were solved with automatic code cleaning to aid code execution. We find that 74\% of R files crashed in the initial execution, while 56\% crashed when code cleaning was applied, showing that many errors can be prevented with good coding practices. We also analyze the replication datasets from journals' collections and discuss the impact of the journal policy strictness on the code re-execution rate. Finally, based on our results, we propose a set of recommendations for code dissemination aimed at researchers, journals, and repositories.

翻译：本文介绍了哈佛数据库中公开提供的复制数据集研究守则的质量和执行情况研究。研究守则通常由一组科学家制定,并与学术论文一起出版,以促进研究的透明度和可复制性。我们为本研究确定了10个问题,以解决影响研究再生和再利用的各个方面。首先,我们检索和分析2010年至2020年出版的超过9 000个独特的R文件的2000多个复制数据集。第二,我们在一个清洁运行的环境下执行守则,以评估其再利用的便利性。确定了共同编码错误,其中一些错误通过自动代码清理解决,以帮助执行守则。我们发现,74个R文件在最初执行过程中崩溃,而在应用编码清理时有56个故障,表明许多错误可以通过良好的编码做法加以防止。我们还分析了期刊收藏的复制数据集,并讨论了期刊政策严格程度对代码再执行率的影响。最后,根据我们的结果,我们提出了一套针对研究人员、期刊和仓库的代码传播建议。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【重磅】2021年IEEE Fellow出炉！ 282位新晋升会士！七十多位华人当选！

专知会员服务

23+阅读 · 2020年11月25日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【NeurIPS 2019-教程】强化学习:过去、现在和未来展望（Rinforcement Learning: Past, Present, and Future Perspectives），微软首席研究员Katja Hofmann

专知会员服务

59+阅读 · 2019年12月9日

【O'Reilly TensorFlow Conference 2019】TensorFlow，开源和IBM（TensorFlow, open source, and IBM ），IBM | Fred Reiss

专知会员服务

11+阅读 · 2019年11月14日