Although high-performance computing (HPC) systems have been scaled to meet the exponentially-growing demand for scientific computing, HPC performance variability remains a major challenge and has become a critical research topic in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management and is nontrivial because one needs to predict a distribution function based on system factors. In this paper, we propose a new framework to predict performance distributions. The proposed model is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We evaluate the performance of the proposed method by using the IOzone variability data based on various prediction tasks. Results show that the proposed method can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our methods can be further used as a surrogate model for HPC system variability monitoring and optimization.
翻译:虽然高性能计算(HPC)系统的规模已经扩大,以满足对科学计算的巨大增长需求,HPC性能可变性仍然是一项重大挑战,并且已成为计算机科学中的一个关键研究课题。从统计学上看,性能可变性的特点是分布。预测性能可变性是HPC性能可变性管理中的一个关键步骤,而且由于需要根据系统因素预测分配功能,因此是非边际性的,因为需要根据系统因素预测一个分配功能。在本文件中,我们提出一个新的框架来预测性能分布。提议的模式是一个经修改的高斯进程,可以预测投入/产出(I/O)通过高斯系统配置的分布函数的分布功能功能。我们还施加单调限制,以便预期的功能是非递减性能,这是累积性能分配功能的属性。此外,拟议的模型可以同时结合定量和定性的输入变量。我们通过使用基于各种预测任务的IOzL的变异性数据来评估拟议方法的绩效。结果表明,拟议的方法可以产生准确的预测,并超越现有方法。我们还说明如何将预期性功能产出用于生成系统平均性变率的模型,作为我们系统性能的系统性能监测方法。