Diagnosing storage system failures is challenging even for professionals. One example is the "When Solid State Drives Are Not That Solid" incident occurred at Algolia data center, where Samsung SSDs were mistakenly blamed for failures caused by a Linux kernel bug. With the system complexity keeps increasing, such obscure failures will likely occur more often. As one step to address the challenge, we present our on-going efforts called X-Ray. Different from traditional methods that focus on either the software or the hardware, X-Ray leverages virtualization to collects events across layers, and correlates them to generate a correlation tree. Moreover, by applying simple rules, X-Ray can highlight critical nodes automatically. Preliminary results based on 5 failure cases shows that X-Ray can effectively narrow down the search space for failures.
翻译:即便对专业人士来说,诊断存储系统失败也是很困难的。 一个例子就是“当固体状态驱动器不是固体”事件发生在Algolia数据中心,在那里,三星SSD因Linux内核错误造成的失败被错误地指责为Linux内核错误。随着系统复杂性的不断增长,这种模糊的失败可能更经常发生。作为应对挑战的一个步骤,我们介绍了我们正在进行的名为X-Ray的努力。与侧重于软件或硬件的传统方法不同,X-Ray利用虚拟化手段收集跨层的事件,并将它们与产生相关树联系起来。此外,通过应用简单规则,X-Ray能够自动突出关键节点。基于5个失败案例的初步结果显示X-Ray可以有效地缩小失败的搜索空间。