Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be readable, executable, or sustainable in the long term. A set of criteria is introduced to address this problem: Completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free and open source software. As a proof of concept, we introduce "Maneage" (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that has been tested in several research publications. We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility. The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders. This article is itself a Maneage'd project (project commit 54e4eb2).
翻译:为解决这一问题,引入了一套标准:完整性(除了最低限度的Unix类操作系统之外,没有执行要求,没有管理员特权,没有网络连接,主要储存在纯文本中);模块设计;最低复杂性;可缩放性;可核查的投入和产出;版本控制;将分析与叙述联系起来;自由开放源码软件。作为概念的证明,我们引入了“管理”(管理数据线),允许廉价的存档、出处提取和同行核查,并在一些研究出版物中进行了测试。我们表明,长寿是一项现实的要求,不会牺牲眼前或短期的再生能力。随后讨论了警告(以及拟议的解决办法),我们最后提出了对各利益攸关方的好处。这篇文章本身就是一个Maneage'd项目(项目承诺54e4eb2)。