In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned from underlying file systems, which enables the investigation on the error resilience characteristics of the scientific file format. We demonstrate the use of FFIS with three representative real HPC applications, showing how each application reacts to the data corruptions, and provide insights on the error resilience of the widely adopted HDF5 file format for the HPC applications.
翻译:近年来,科学模拟日益复杂,对培训重人造情报模型的需求日益高涨,这就要求大量快速的数据存取,这促使高性能计算平台安装更先进的储存基础设施,如固态磁盘。虽然SSDS提供高性能I/O,但HPC应用在SSD相关故障下面临的可靠性挑战仍然不清楚,特别是造成数据腐败的失败。本文件的目的是了解SSD相关缺陷对复杂的高氯酸盐应用行为的影响。为此,我们提议FFIS,一个基于FUSE的错误注入框架,系统地将存储故障引入应用层,以模拟SSDs产生的错误。FFIS能够将不同的I/O相关故障植入从基本档案系统返回的数据,从而能够调查科学档案格式的错误复原力特性。我们用三种具有代表性的具有代表性的HPC应用程序演示了FFIS的使用情况,展示了每项应用如何应对数据腐败,并提供了广泛采用的HDF5文件格式对HPC应用的错误复原性。