The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, the Software Heritage Foundation is working to archive public source code, but there is value in archiving the issue threads, pull requests, and wikis that provide important context to the code while maintaining their original URLs. In current implementations, source code and its ephemera are not preserved, which presents a problem for scholarly projects where reproducibility matters. To understand and quantify the scope of this issue, we analyzed the use of GHP URIs in the arXiv and PMC corpora from January 2007 to December 2021. In total, there were 253,590 URIs to GitHub, SourceForge, Bitbucket, and GitLab repositories across the 2.66 million publications in the corpora. We found that GitHub, GitLab, SourceForge, and Bitbucket were collectively linked to 160 times in 2007 and 76,746 times in 2021. In 2021, one out of five publications in the arXiv corpus included a URI to GitHub. The complexity of GHPs like GitHub is not amenable to conventional Web archiving techniques. Therefore, the growing use of GHPs in scholarly publications points to an urgent and growing need for dedicated efforts to archive their holdings in order to preserve research code and its scholarly ephemera.
翻译:学术内容的定义已经扩大,包括了有助于出版的数据和源代码。虽然保存传统学术内容的主要归档工作(例如LOCKSS、CLOCKSS、Portico)正在进行中,但保存这些PDF(例如LOCKSS、CLOCKSS、Portico)中引用的数据和代码的类似工作尚未出现,特别是Git Hosting平台(GHP)在线托管的学术代码。同样,软件遗产基金会正在努力将公共源代码归档,但将发行线索、拉动请求和维基(wiki)存档,为代码提供重要背景,同时维护原始URF(例如LOCKSS、CLOCSS、Portico,Porticodi)中通常保存的数据和代码。在目前的执行、源代码及其电子数据库中,GiHPOFO(GH)和GiHBO(GH)数据库中,GiHFOOO(GH)的5OF、OForge、BIH(GH)数据库和GIHBL(GO)数据库中,在266和VEL)中不断增加的版本出版物中,在GIHDO。