Dynamic random access memory failures are a threat to the reliability of data centres as they lead to data loss and system crashes. Timely predictions of memory failures allow for taking preventive measures such as server migration and memory replacement. Thereby, memory failure prediction prevents failures from externalizing, and it is a vital task to improve system reliability. In this paper, we revisited the problem of memory failure prediction. We analyzed the correctable errors (CEs) from hardware logs as indicators for a degraded memory state. As memories do not always work with full occupancy, access to faulty memory parts is time distributed. Following this intuition, we observed that important properties for memory failure prediction are distributed through long time intervals. In contrast, related studies, to fit practical constraints, frequently only analyze the CEs from the last fixed-size time interval while ignoring the predating information. Motivated by the observed discrepancy, we study the impact of including the overall (long-range) CE evolution and propose novel features that are calculated incrementally to preserve long-range properties. By coupling the extracted features with machine learning methods, we learn a predictive model to anticipate upcoming failures three hours in advance while improving the average relative precision and recall for 21% and 19% accordingly. We evaluated our methodology on real-world memory failures from the server fleet of a large cloud provider, justifying its validity and practicality.
翻译:动态随机存取存储失败对数据中心的可靠性构成了威胁,因为它们导致数据丢失和系统崩溃。 及时预测记忆失灵可以采取诸如服务器迁移和记忆替换等预防措施。 因此, 记忆失灵预测防止了外部化的失败, 而提高系统可靠性是一项至关重要的任务。 在本文件中, 我们再次讨论了记忆失灵预测问题。 我们分析了硬件日志中可更正的错误(CES), 以作为退化记忆状态的指标。 由于记忆并非总能完全使用, 错误的记忆部件的存取是时间分布的。 根据这一直觉, 我们观察到, 记忆失灵预测的重要特性是通过较长的时间间隔分布的。 相反, 相关的研究, 适应实际限制, 经常只是分析上次固定时间间隔中的 CE, 而忽略预设的信息。 我们根据观察到的差异, 研究将硬件日志( 长距离) 误差纳入总体 CEE 演变的影响, 并提出新特征, 以渐进的方式计算来保存远程的特性。 通过将提取的特性与机器学习方法相结合, 我们学习了一个预测模型, 预估的模型, 提前三小时即将发生失败, 。 我们从19 级服务器的平均精确度 和回算出 21 正确度 ( ) 正确度 ( ) 正确度), 正确度 和回算出一个实际的服务器 21 实际的机舱机舱 的 的正确度, 21 方法, 我们 的精确点点 的精确度, 的精确度 。