Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments show that the principles enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability, and an order of magnitude of query acceleration under certain workloads against without our representation.
翻译:最近,它也深刻地影响计算科学和工程领域的计算科学和工程领域,如地球科学、气候科学和健康科学。在这些领域,用户需要结合科学数据和ML模型进行全面的数据分析,以提供关键要求,如可复制性、模型解释性和实验数据理解。然而,科学ML是多学科的,差异性,受到域内物理限制的影响,使这种分析更具挑战性。在这项工作中,我们利用工作流程源技术来构建一个整体观点,以支持科学ML的生命周期。我们协助(一) 确定生命周期和数据分析分类;(二) 设计构建这一观点的原则,包括W3C PROV符合数据说明和参照系统架构;(三) 利用拥有393个节点和946 GPUs的HPC集群对石油和天然气案例进行评估后获得的经验教训。实验表明,这些原则有助于将域的语义与ML模型结合,同时保持低空压( < 1%)、高比例性、高比例性以及某种程度的测量顺序。