In the era of data-driven science, conducting computational experiments that involve analysing large datasets using heterogeneous computational clusters, is part of the everyday routine for many scientists. Moreover, to ensure the credibility of their results, it is very important for these analyses to be easily reproducible by other researchers. Although various technologies, that could facilitate the work of scientists in this direction, have been introduced in the recent years, there is still a lack of open source platforms that combine them to this end. In this work, we describe and demonstrate SCHeMa, an open-source platform that facilitates the execution and reproducibility of computational analysis on heterogeneous clusters, leveraging containerization, experiment packaging, workflow management, and machine learning technologies.
翻译:在数据驱动科学的时代,进行计算实验,包括利用不同计算组群分析大型数据集,这是许多科学家日常工作的一部分,此外,为了确保分析结果的可信度,非常重要的是,这些分析可以很容易地被其他研究人员复制出来,虽然近年来引进了各种技术,可以促进科学家朝这个方向开展工作,但目前仍然缺乏将它们结合起来的开放源平台。在这项工作中,我们描述并展示了SCHEMA,这是一个开放源码平台,它有助于对不同组群、利用集装箱化、实验包装、工作流程管理以及机器学习技术进行计算分析并重新推广。