Improving datacenter operations is vital for the digital society. We posit that doing so requires our community to shift, from operational aspects taken in isolation to holistic analysis of datacenter resources, energy, and workloads. In turn, this shift will require new analysis methods, and open-access, FAIR datasets with fine temporal and spatial granularity. We leverage in this work one of the (rare) public datasets providing fine-grained information on datacenter operations. Using it, we show strong evidence that fine-grained information reveals new operational aspects. We then propose a method for holistic analysis of datacenter operations, providing statistical characterization of node, energy, and workload aspects. We demonstrate the benefits of our holistic analysis method by applying it to the operations of a datacenter infrastructure with over 300 nodes. Our analysis reveals both generic and ML-specific aspects, and further details how the operational behavior of the datacenter changed during the 2020 COVID-19 pandemic. We make over 30 main observations, providing holistic insight into the long-term operation of a large-scale, public scientific infrastructure. We suggest such observations can help immediately with performance engineering tasks such as predicting future datacenter load, and also long-term with the design of datacenter infrastructure.
翻译:改进数据中心操作对于数字社会至关重要。 我们假设,这样做需要我们的社区从孤立地对数据中心资源、能源和工作量进行整体分析,从业务方面转向对数据中心资源、能源和工作量进行整体分析。反过来,这一转变需要新的分析方法,以及开放获取的具有细微时空颗粒的FAIR数据集。我们在这项工作中利用一个(罕见的)公共数据集,提供精确的数据中心操作信息。利用它,我们显示出微小的信息揭示了新的操作方面。我们随后提出了一个全面分析数据中心操作的方法,提供了节点、能源和工作量方面的统计特征。我们展示了我们整体分析方法的好处,将它应用于拥有300多个节点的数据中心基础设施的操作。我们的分析揭示了通用和ML的具体方面,并进一步详细说明了数据中心在2020年COVID-19大流行期间的业务行为是如何变化的。我们做了30多项主要观察,为大规模公共科学基础设施的长期运行提供了全面的洞察力。我们提出这样的观察结果,有助于对长期的工程设计任务进行即时期设计。