启用使用 On-Die 错误更正代码的内存芯片中有效减少错误 (Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes)

Improvements in main memory storage density are primarily driven by process technology scaling, which negatively impacts reliability by exacerbating various circuit-level error mechanisms. To compensate for growing error rates, both memory manufacturers and consumers use error-mitigation mechanisms that improve manufacturing yield and allow system designers to meet reliability targets. Developing effective error mitigations requires understanding the errors' characteristics (e.g., worst-case behavior, statistical properties). Unfortunately, we observe that proprietary on-die Error-Correcting Codes (ECC) used in modern memory chips introduce new challenges to efficient error mitigation by obfuscating CPU-visible error characteristics in an unpredictable, ECC-dependent manner. This dissertation builds a detailed understanding of how on-die ECC obfuscates the statistical properties of main memory error mechanisms using a combination of real-chip experiments and statistical analyses. We experimentally study memory errors, examine how on-die ECC obfuscates their statistical characteristics, and develop new testing techniques to overcome the obfuscation. Our results show that the obfuscated error characteristics can be recovered using new memory testing techniques that exploit the interaction between on-die ECC and the statistical characteristics of memory error mechanisms to expose physical cell behavior. We conclude by discussing the critical need for transparency in DRAM reliability characteristics in order to enable DRAM consumers to better understand and adapt commodity DRAM chips to their system-specific needs. We hope and believe that the analysis, techniques, and results we present in this dissertation will enable the community to better understand and tackle current and future reliability challenges as well as adapt commodity memory to new advantageous applications.

翻译：改进主要存储存储密度主要是由流程技术的提升驱动的,这通过加剧各种电路级误差机制对可靠性产生不利影响。为了弥补不断上升的误差率,记忆制造商和消费者都使用减少误差的机制,提高制造业产量,使系统设计者能够达到可靠性目标。制定有效的误差减缓办法需要了解误差的特征(例如最坏情况行为、统计属性等)。不幸的是,我们注意到,现代记忆芯片中使用的专利在线误差校准代码(ECC)给有效减少误差带来了新的挑战,因为通过以不可预测的、依赖ECC的可靠性的方式模糊的CPU可识别误差特性,从而对可靠性产生消极影响。这种分解有助于详细了解主要的误差机制的统计特性,同时结合实芯片实验和统计分析。我们实验性地研究误判ECC的误差,研究如何校正其统计特征,并开发新的测试技术,以克服易解的难题。我们的结果表明,无法用新的误差性误差特性来恢复 CRC的当前误差特性应用方法,通过统计性特性分析,使DRM的误判机制能够使C的误判。