In Earth Systems Science, many complex data pipelines combine different data sources and apply data filtering and analysis steps. Typically, such data analysis processes are historically grown and implemented with many sequentially executed scripts. Scientific workflow management systems (SWMS) allow scientists to use their existing scripts and provide support for parallelization, reusability, monitoring, or failure handling. However, many scientists still rely on their sequentially called scripts and do not profit from the out-of-the-box advantages a SWMS can provide. In this work, we transform the data analysis processes of a Machine Learning-based approach to calibrate the platform magnetometers of non-dedicated satellites utilizing neural networks into a workflow called Macaw (MAgnetometer CAlibration Workflow). We provide details on the workflow and the steps needed to port these scripts to a scientific workflow. Our experimental evaluation compares the original sequential script executions on the original HPC cluster with our workflow implementation on a commodity cluster. Our results show that through porting, our implementation decreased the allocated CPU hours by 50.2% and the memory hours by 59.5%, leading to significantly less resource wastage. Further, through parallelizing single tasks, we reduced the runtime by 17.5%.
翻译:在地球系统科学中,许多复杂的数据管道结合了不同的数据源,并应用了数据过滤和分析步骤。通常,这类数据分析过程历来都是通过许多按顺序执行的脚本来增长和实施的。科学工作流程管理系统允许科学家使用其现有脚本,并为平行、可重复、监测或故障处理提供支持。然而,许多科学家仍然依赖其按顺序命名的脚本,而不能从SWMS提供的箱外优势中获益。在这项工作中,我们将机械学习方法的数据分析过程转换为利用神经网络校准非专用卫星的平台磁强计,将其校准为称为Macaw(磁力计校正流程)的工作流程。我们提供了关于将这些脚本移植到科学工作流程所需的工作流程和步骤的细节。我们的实验性评价将原HPC集群原有的脚本执行与我们在商品集群上执行工作流程的原脚本进行比较。我们的成果显示,通过移植,我们的实施将分配的CPU时数减少了50.2 %,将记忆时数减少了59.5 %,导致资源运行减少17%。我们进一步通过平行任务减少。