This article presents a study on the quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository. Research code is typically created by a group of scientists and published together with academic papers to facilitate research transparency and reproducibility. For this study, we define ten questions to address aspects impacting research reproducibility and reuse. First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020. Second, we execute the code in a clean runtime environment to assess its ease of reuse. Common coding errors were identified, and some of them were solved with automatic code cleaning to aid code execution. We find that 74\% of R files crashed in the initial execution, while 56\% crashed when code cleaning was applied, showing that many errors can be prevented with good coding practices. We also analyze the replication datasets from journals' collections and discuss the impact of the journal policy strictness on the code re-execution rate. Finally, based on our results, we propose a set of recommendations for code dissemination aimed at researchers, journals, and repositories.
翻译:本文介绍了哈佛数据库中公开提供的复制数据集研究守则的质量和执行情况研究。研究守则通常由一组科学家制定,并与学术论文一起出版,以促进研究的透明度和可复制性。我们为本研究确定了10个问题,以解决影响研究再生和再利用的各个方面。首先,我们检索和分析2010年至2020年出版的超过9 000个独特的R文件的2000多个复制数据集。第二,我们在一个清洁运行的环境下执行守则,以评估其再利用的便利性。确定了共同编码错误,其中一些错误通过自动代码清理解决,以帮助执行守则。我们发现,74个R文件在最初执行过程中崩溃,而在应用编码清理时有56个故障,表明许多错误可以通过良好的编码做法加以防止。我们还分析了期刊收藏的复制数据集,并讨论了期刊政策严格程度对代码再执行率的影响。最后,根据我们的结果,我们提出了一套针对研究人员、期刊和仓库的代码传播建议。