As HPC systems grow in complexity, efficient and manageable operation is increasingly critical. Many centers are thus starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from massive amounts of monitoring data and use it for control and visualization purposes. As ODA is a multi-faceted problem, much effort has gone into researching its separate aspects: however, accounts of production ODA experiences are still hard to come across. In this work we aim to bridge the gap between ODA research and production use by presenting our experiences with ODA in production, involving in particular the control of cooling infrastructures and visualization of job data on two HPC systems. We cover the entire development process, from design to deployment, highlighting our insights in an effort to drive the community forward. We rely on open-source tools, which make for a generic ODA framework suitable for most scenarios.
翻译:由于高电联系统日趋复杂,高效和可管理的运作越来越重要,许多中心因此开始探索使用实用数据分析技术,从大量的监测数据中提取知识,并将其用于控制和可视化目的;由于官方发展援助是一个多方面的问题,因此已作出很大努力研究其不同方面:然而,官方发展援助的生产经验说明仍然难以找到;在这项工作中,我们的目的是通过介绍我们在官方发展援助生产方面的经验,特别是控制冷却基础设施和两个高电联系统的可视化工作数据,弥合官方发展援助研究和生产使用之间的差距;我们涵盖整个发展进程,从设计到部署,突出我们推动社区前进的洞察力;我们依靠开放源工具,建立适合多数情景的通用官方发展援助框架。