The ARCHER2 service, a CPU based HPE Cray EX system with 750,080 cores (5,860 nodes), has been deployed throughout 2020 and 2021, going into full service in December of 2021. A key part of the work during this deployment was the integration of ARCHER2 into our local monitoring systems. As ARCHER2 was one of the very first large-scale EX deployments, this involved close collaboration and development work with the HPE team through a global pandemic situation where collaboration and co-working was significantly more challenging than usual. The deployment included the creation of automated checks and visual representations of system status which needed to be made available to external parties for diagnosis and interpretation. We will describe how these checks have been deployed and how data gathered played a key role in the deployment of ARCHER2, the commissioning of the plant infrastructure, the conduct of HPL runs for submission to the Top500 and contractual monitoring of the availability of the ARCHER2 service during its commissioning and early life.
翻译:ARCHER2服务是一款基于CPU的 HPE Cray EX 系统,拥有750,080 核心(5,860节点)。它于2020年和2021年期间部署,并于2021年12月进入全面服务阶段。在此部署期间,我们本地监控系统与 ARCHER2 的集成是工作的关键部分。由于 ARCHER2 是最早的大规模的 EX 部署之一,这就需要在全球大流行病的情况下与 HPE 团队进行紧密的合作和开发工作,而这种情况使得合作和共同工作变得更加具有挑战性。部署包括创建自动检查和系统状态的可视化表示,这些检查需要对外部方进行诊断和解释。我们将描述这些检查是如何部署的,数据如何发挥在 ARCHER2 部署、实施工厂基础设施、进行 HPL 运行以提交给 Top500 和在其实施和初期生命周期期间对 ARCHER2 服务的可用性进行合同监控中发挥关键作用。