Radiation-induced soft errors are one of the most challenging issues in Safety Critical Real-Time Embedded System (SACRES) reliability, usually handled using different flavors of Double Modular Redundancy (DMR) techniques. This solution is becoming unaffordable due to the complexity of modern micro-processors in all domains. This paper addresses the promising field of using Artificial Intelligence (AI) based hardware detectors for soft errors. To create such cores and make them general enough to work with different software applications, microarchitectural attributes are a fascinating option as candidate fault detection features. Several processors already track these features through dedicated Performance Monitoring Unit (PMU). However, there is an open question to understand to what extent they are enough to detect faulty executions. Exploiting the capability of gem5 to simulate real computing systems, perform fault injection experiments and profile microarchitectural attributes (i.e., gem5 Stats), this paper presents the results of a comprehensive analysis regarding the potential attributes to detect soft error and the associated models that can be trained with these features.
翻译:辐射诱发的软错误是安全临界实时嵌入系统(SACRES)可靠性中最具挑战性的问题之一,通常使用不同口味的双模重复(DMR)技术处理。由于现代微处理器在所有领域的复杂性,这一解决方案变得难以负担。本文件述及利用人工智能(AI)硬件探测器进行软错误的有希望的领域。为了创建这些核心,使其具有与不同软件应用相适应的通用性,微成像特征是作为候选故障检测特征的一个令人着迷的选择。一些处理器已经通过专门的性能监测股跟踪了这些特征。然而,有一个开放的问题,以了解这些特征在多大程度上足以检测错误处决。探索宝石5 模拟真实计算系统、进行过错注射实验和描述微成形特征(如宝石5 Stats)的能力。本文件介绍了关于潜在特征的全面分析的结果,以发现软误和可以用这些特征培训的相关模型。