In recent years, high availability and reliability of Data Storage Systems (DSS) have been significantly threatened by soft errors occurring in storage controllers. Due to their specific functionality and hardware-software stack, error propagation and manifestation in DSS is quite different from general-purpose computing architectures. To our knowledge, no previous study has examined the system-level effects of soft errors on the availability and reliability of data storage systems. In this paper, we first analyze the effects of soft errors occurring in the server processors of storage controllers on the entire storage system dependability. To this end, we implemented the major functions of a typical data storage system controller, running on a full stack of storage system operating system, and developed a framework to perform fault injection experiments using a full system simulator. We then propose a new metric, Storage System Vulnerability Factor (SSVF), to accurately capture the impact of soft errors in storage systems. By conducting extensive experiments, it is revealed that depending on the controller configuration, up to 40% of cache memory contains end-user data where any unrecoverable soft errors in this part will result in Data Loss (DL) in an irreversible manner. However, soft errors in the rest of cache memory filled by Operating System (OS) and storage applications will result in Data Unavailability (DU) at the storage system level. Our analysis also shows that Detectable Unrecoverable Errors (DUEs) on the cache data field are the major cause of DU in storage systems, while Silent Data Corruptions (SDCs) in the cache tag and data field are mainly the cause of DL in storage systems.
翻译:近年来,数据存储系统(DSS)的高可用性和可靠性受到存储控制器中出现的软差错的极大威胁。由于存储控制器中存在一些软差错,由于存储控制器中存在特定的功能和硬件软件堆叠,DSS中的错误传播和表现与一般目的计算结构大不相同。 据我们所知,以前没有一项研究审查了软差错对数据存储系统可用性和可靠性的系统影响。在本文件中,我们首先分析了存储控制器服务器处理器中出现的软差错对整个存储系统可靠性的影响。为此,我们实施了典型数据存储系统控制器的主要功能,在存储系统运行全堆的操作系统运行中运行,并开发了一个框架,利用一个完整的系统模拟器进行错误注入实验。然后我们提出了一个新的存储系统脆弱性参数(SSVFF),以准确捕捉存储系统软错误的影响。通过进行广泛的实验,我们发现,存储存储系统中高达40%的存储存储存储存储存储存储器中的任何不可弥补的软差错(DL)将导致数据丢失(存储系统在主要存储存储过程中的存储系统中进行软差错),而数据存储系统在存储系统中进行不易的存储的存储系统(运行的存储系统在存储系统中进行中将主要的存储结果的存储结果。