With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows, heterogeneous machine architectures, parallel file systems, and batch scheduling, care must be taken to facilitate this analysis in a high performance computing (HPC) environment. In this paper, we present Merlin, a workflow framework to enable large ML-friendly ensembles of scientific HPC simulations. By augmenting traditional HPC with distributed compute technologies, Merlin aims to lower the barrier for scientific subject matter experts to incorporate ML into their analysis. In addition to its design, we describe some example applications that Merlin has enabled on leadership-class HPC resources, such as the ML-augmented optimization of nuclear fusion experiments and the calibration of infectious disease models to study the progression of and possible mitigation strategies for COVID-19.
翻译:随着计算和实验设施日益复杂,许多科学研究人员正在转向机器学习技术,以分析大规模混合数据,由于复杂因素,如多构件工作流程、各式机器结构、平行文件系统和批量时间安排,必须注意在高性能计算环境中促进这种分析。本文介绍Merlin,这是一个工作流程框架,可以使大型多功能、无害于ML的大型高氯素模拟组合。通过使用分布式计算技术,Merlin旨在降低科学主题专家的屏障,使其将多功能、多功能结构、平行文件系统和批量列表纳入分析。除了设计外,我们描述了Merlin在领导级高能计算资源上促成的一些应用,例如核聚变实验的ML强化优化和传染病模型的校准,以研究COVID-19的进展和可能的缓解战略。