Scientific workflows consist of thousands of highly parallelized tasks executed in a distributed environment involving many components. Automatic tracing and investigation of the components' and tasks' performance metrics, traces, and behavior are necessary to support the end user with a level of abstraction since the large amount of data cannot be analyzed manually. The execution and monitoring of scientific workflows involves many components, the cluster infrastructure, its resource manager, the workflow, and the workflow tasks. All components in such an execution environment access different monitoring metrics and provide metrics on different abstraction levels. The combination and analysis of observed metrics from different components and their interdependencies are still widely unregarded. We specify four different monitoring layers that can serve as an architectural blueprint for the monitoring responsibilities and the interactions of components in the scientific workflow execution context. We describe the different monitoring metrics subject to the four layers and how the layers interact. Finally, we examine five state-of-the-art scientific workflow management systems (SWMS) in order to assess which steps are needed to enable our four-layer-based approach.
翻译:科学工作流程的执行和监测涉及多个组成部分、集群基础设施、资源管理者、工作流程和工作流程任务。执行环境的所有组成部分都使用不同的监测指标,并提供关于不同抽象程度的衡量标准。对不同组成部分及其相互依存关系的观察指标进行综合和分析仍然广泛无人注意。我们指定了四个不同的监测层,作为监测责任和科学工作流程执行中各组成部分相互作用的建筑蓝图。我们描述了四层的不同监测指标以及各层的互动方式。最后,我们研究了五个最先进的科学工作流程管理系统,以便评估需要采取哪些步骤来使我们的四层方法得以采用。