Soft errors induced by radiations are one of the most challenging issues impacting the electronic systems' reliability. In the embedded domain, system designers often rely on Double Modular Redundancy (DMR) to reach the desired level of reliability. This solution has increasing overhead, given the complexity achieved by modern microprocessors in all domains. This paper addresses the promising field of using efficient machine-learning powered cores for the detection of soft errors. To create such cores and make them general enough to work with different software applications, microarchitectural attributes are a fascinating option as candidate fault detection features. Several processors already track these features through dedicated Performance Monitoring Units. However, there is an open question to understand to what extent they are enough to detect faulty executions. This paper moves a step forward in this direction. Exploiting the capability of \textit{gem5} to simulate real computing systems, perform fault injection experiments and profile microarchitectural attributes (i.e., \textit{gem5} Stats), this paper presents the results of a comprehensive analysis regarding the potential attributes to be used for soft error detection and associated models that can be trained with these features. Particular emphasis was devoted to understanding whether event timing could bring additional information to the detection task. This information is crucial to identify the best machine learning models to employ in this challenging task.
翻译:由辐射引发的软错误是影响电子系统可靠性的最具有挑战性的问题之一。 在嵌入域中,系统设计者往往依赖双模再冗余(DMR)来达到理想的可靠性水平。 鉴于现代微处理器在所有领域都取得了复杂程度,这一解决方案增加了管理管理费用。 本文述及使用高效机学习动力核心来模拟软错误的有希望的领域。 要创建这些核心并使其广泛适用于不同的软件应用程序, 微成像仪特征就是一个吸引人的选项。 一些处理器已经通过专门的性能监测单位跟踪这些特征。 但是, 还有一个开放的问题, 要了解它们在多大程度上足以检测错误处决。 本文朝这个方向前进了一步。 开发了 kextit{gem5} 模拟真实计算机系统、 进行过错注入实验和描述微成像仪属性( e.,\ textitit{gem5} Stats) 的功能。 本文可以展示关于这种潜在属性的全面分析结果的结果, 以便通过专门的性能测算出最有挑战性的工作模式。 如何将这种经过训练的模型用于最精确的识别。