This work examines the challenges and opportunities of Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastructure, job scheduling, and application parameter tuning). In this work, we take the position that QCS in general, and MODA in particular, require close exchange with the ML community to realize the full potential of data-driven analysis for the benefit of existing and future HPC systems. This exchange will facilitate identifying the appropriate ML methods to gain insights into current HPC systems and to go beyond expert-based knowledge and rules of thumb.
翻译:这项工作审查了监测和操作数据分析机器学习(ML)在超级计算机定量编码(QCS)背景下的监测和操作数据分析(MODA)的挑战和机遇,利用MODA了解当前高性能计算系统的行为,以提高系统的效率、性能和可靠性(例如,通过优化冷却基础设施、工作时间安排和应用参数调控),在这项工作中,我们认为,一般QCS,特别是MODA,需要与ML社区密切交流,以实现数据驱动分析的全部潜力,使现有和今后的HPC系统受益,这种交流将有助于确定适当的ML方法,以便了解目前的HPC系统,超越专家知识和大拇指规则。