On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 during training and 0.86 during testing over the exhaustive 10 year data while being able to generalize competitively over other drives from the Seagate family.
翻译:每天,数据中心处理着海量数据,而这些数据依赖于价格便宜的硬盘。存储在这些硬盘中的数据服务于从金融、医疗到航天等关键功能需求。由此,硬盘过早失效和随之而来的数据丢失可能会是灾难性的。为了减轻故障风险,云存储提供商进行基于条件的监控并在它们失败之前替换硬盘。通过估算硬盘的剩余可用寿命,可以预测特定设备的故障时间并在正确的时间更换它,确保最大化利用同时减少运营成本。在本研究中,使用定制的特征工程和一组序列学习器来处理严重偏斜的健康统计数据进行大规模预测分析。过去的工作表明,使用LSTM是预测剩余寿命的一种很好方法。为此,我们提出了一种编码器-解码器LSTM模型,其中从理解健康统计序列中获得的上下文有助于预测硬盘可能故障前剩余的天数输出序列。在本文中开发的模型是在Backblaze公司有关所有10年S.M.A.R.T.健康数据的广泛集合上进行训练和测试的,以及测试多种不同的硬盘实例。它大大缩小了从事的研究人员在全部健康数据的年份上扩展工作流的知识差距,并通过提供可评估和可推广的切实可行的指标,推进了最新技术。编码器-解码器LSTM在训练期间的RMSE是0.83,在测试期间的RMSE是0.86,而能够在希捷家族的其他驱动器上进行有竞争力的泛化。