To help reliability of SSD arrays, Redundant Array of Independent Disks (RAID) are commonly employed. However, the conventional reliability models of HDD RAID cannot be applied to SSD arrays, as the nature of failures in SSDs are different from HDDs. Previous studies on the reliability of SSD arrays are based on the deprecated SSD failure data, and only focus on limited failure types, device failures, and page failures caused by the bit errors, while recent field studies have reported other failure types including bad blocks and bad chips, and a high correlation between failures. In this paper, we explore the reliability of SSD arrays using field storage traces and real-system implementation of conventional and emerging erasure codes. The reliability is evaluated by statistical fault injections that post-process the usage logs from the real-system implementation, while the fault/failure attributes are obtained from field data. As a case study, we examine conventional and emerging erasure codes in terms of both reliability and performance using Linux MD RAID and commercial SSDs. Our analysis shows that a) emerging erasure codes fail to replace RAID6 in terms of reliability, b) row-wise erasure codes are the most efficient choices for contemporary SSD devices, and c) previous models overestimate the SSD array reliability by up to six orders of magnitude, as they focus on the coincidence of bad pages and bad chips that roots the minority of Data Loss (DL) in SSD arrays. Our experiments show that the combination of bad chips with bad blocks is the major source of DL in RAID5 and emerging codes (contributing more than 54% and 90% of DL in RAID5 and emerging codes, respectively), while RAID6 remains robust under these failure combinations. Finally, the fault injection results show that SSD array reliability, as well as the failure breakdown is significantly correlated with SSD type.
翻译:为了帮助 SDD 阵列的可靠性, 通常使用独立磁盘( RAID) 的重新定义阵列。 但是, HDD RAID 的常规可靠性模型不能应用于 SDD 阵列, 因为 SDD 失败的性质与 HDD 不同。 以前关于 SDD 阵列可靠性的研究基于已折旧的 SSDD 失败数据, 仅侧重于因小错误造成的有限故障类型、 设备故障和页面故障, 而最近的实地研究则报告了其他故障类型, 包括坏块和坏芯片, 以及各种故障之间的高度相关性。 但是, 我们在本文件中, 使用 SSD RAID 常规可靠性模型的常规可靠性模型不能适用于外地系统运行的可靠性。 而 错误/ 错误属性来自实地数据源。 作为案例研究, 我们用 Linux MDD RAID 和 商业 SDDD 进行常规和新兴的SDSD 组合代码, 我们的分析表明, 正在不断更新的SDAD 代码无法在SAD RAD 的可靠程度中, SAID 的SAD RAD RD RDR 的最近的代码显示, SAD 6 的错误代码显示, SAD RAD RD RD RD RD RD RD RD RD RD RD RD RD RD RD RD 的错误的频率中的错误的精点的错误的精度的精度, 的精确点的精确点的精度。