分布式数据中心硬盘的大规模寿终期预测</s> (Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters)

On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 on an exhaustive set while being able to generalize competitively over the other Seagate family hard drives.

翻译：在日常基础上,数据中心处理大量由廉价硬盘扩散所支持的数据,这些磁盘中储存的数据满足了从财政和保健到航空航天的一系列关键功能需求。因此,过早的磁盘故障和由此造成的数据损失可能是灾难性的。为减轻故障风险,云存供应商进行基于条件的监测,并在硬盘失效前更换硬盘。通过估计硬盘驱动器剩余有用寿命,可以预测特定设备的故障时间到故障的时间,在正确的时间将其替换,确保最大限度利用,同时降低业务成本。在这项工作中,利用严重扭曲的健康统计数据进行大规模预测分析,包括定制的功能工程和一系列序列学习者。过去的工作表明,使用LSTMS作为预测剩余有用生命的极好方法。为此,我们展示了一个编码解码LSTM模型,通过理解健康统计数据序列,帮助预测一个在磁盘潜在故障之前剩余天数的输出顺序,确保最大利用率,同时减少操作费用。在这项工作中开发的模型经过培训和测试,将一系列精细的内基数据从S.M.A.A.A.A.A.A.A.A.A.A.A.</s>