This paper reports on the design and implementation of the HPC performance monitoring system deployed to continuously monitor performance metrics of all jobs on the HPC systems at the Max Planck Computing and Data Facility (MPCDF). Thereby it reveals important information to various stakeholders, in particular to users, application support, system administrators, and management. On each compute node, hardware and software performance monitoring data is collected by our newly developed lightweight open-source hpcmd middleware which builds upon standard Linux tools. The data is transported via rsyslog, and aggregated and processed by a Splunk system, enabling detailed per-cluster and per-job interactive analysis in a web browser. Additionally, performance reports are provided to the users as PDF files. Finally, we report on practical experience and benefits from large-scale deployments on MPCDF HPC systems, demonstrating how our solution can be useful to any HPC center.
翻译:本文报告了为持续监测Max Planck计算和数据设施(MPCDF)中HPC系统所有工作的业绩衡量标准而部署的HPC绩效监测系统的设计和实施情况,其中向各利益攸关方,特别是用户、应用支持、系统管理员和管理者披露了重要信息,每个计算节点、硬件和软件绩效监测数据都是由我们新开发的轻质开放源码hpcmd中型软件收集的,它们以标准的Linux工具为基础。数据通过Rsyslog传送,由Splunk系统汇总和处理,从而可以在网络浏览器中进行详细的每组和每职互动分析。此外,业绩报告作为PDF文件提供给用户。最后,我们报告了在MPCDF HPC系统中大规模部署的实际经验和益处,说明我们的解决办法如何对任何HPC中心有用。