Graphics processing units (GPUs) are widely used in many high-performance computing (HPC) applications such as imaging/video processing and training deep-learning models in artificial intelligence. GPUs installed in HPC systems are often heavily used, and GPU failures occur during HPC system operations. Thus, the reliability of GPUs is of interest for the overall reliability of HPC systems. The Cray XK7 Titan supercomputer was one of the top ten supercomputers in the world. The failure event times of more than 30,000 GPUs in Titan were recorded and previous data analysis suggested that the failure time of a GPU may be affected by the GPU's connectivity location inside the supercomputer among other factors. In this paper, we conduct in-depth statistical modeling of GPU failure times to study the effect of location on GPU failures under competing risks with covariates and spatially correlated random effects. In particular, two major failure types of GPUs in Titan are considered. The connectivity locations of cabinets are modeled as spatially correlated random effects, and the positions of GPUs inside each cabinet are treated as covariates. A Bayesian framework is used for statistical inference. We also compare different methods of estimation such as the maximum likelihood, which is implemented via an expectation-maximization algorithm. Our results provide interesting insights into GPU failures in HPC systems.
翻译:图形处理单元(GPU)广泛应用于诸如图像/视频处理和训练人工智能中深度学习模型等高性能计算(HPC)应用中。安装在HPC系统中的GPU通常使用频繁,因此在HPC系统整体可靠性方面,GPU的可靠性具有很大的意义。Cray XK7 Titan超级计算机是世界排名前十的超级计算机之一。在Titan中,记录了30,000多个GPU的故障时间,并且之前的数据分析表明,GPU的故障时间可能会受到连接位置等因素的影响。本文对GPU故障时间进行了深入的统计建模,以研究竞争风险下卡因子和空间相关随机效应的位置对GPU故障的影响。特别是考虑了Titan中两种主要的GPU故障类型。机柜的连接位置被建模为空间相关的随机效应,而每个机柜中GPU的位置被视为协变量。采用贝叶斯框架进行统计推断。我们还比较了不同的估计方法,例如通过期望最大化算法实现的最大似然估计。我们的结果为研究HPC系统中GPU故障提供了有趣的见解。