Hardware reliability is adversely affected by the downscaling of semiconductor devices and the scale-out of systems necessitated by modern applications. Apart from crashes, this unreliability often manifests as silent data corruptions (SDCs), affecting application output. Therefore, we need low-cost and low-human-effort solutions to reduce the incidence rate and the effects of SDCs on the quality of application outputs. We propose Artificial Neural Networks (ANNs) as an effective mechanism for online error detection. We train ANNs using software fault injection. We find that the average overhead of our approach, followed by a costly error correction by re-execution, is 6.45% in terms of CPU cycles. We also report that ANNs discover 94.85% of faults thereby resulting in minimal output quality degradation. To validate our approach we overclock ARM Cortex A53 CPUs, execute benchmarks on them and record the program outputs. ANNs prove to be an efficient error detection mechanism, better than a state of the art approximate error detection mechanism (Topaz), both in terms of performance (12.81% CPU overhead) and quality of application output (94.11% detection coverage).
翻译:半导体装置缩小规模和现代应用要求的系统扩大规模,对硬件的可靠性产生了不利影响。除了碰撞外,这种不可靠性通常表现为无声数据腐败(SDCs),影响应用输出。因此,我们需要低成本和低人力的解决方案,以降低事故率,降低SDCs对应用产出质量的影响。我们提议人工神经网络(ANNS)作为网上发现错误的有效机制。我们用软件输入错误来培训ANNS。我们发现,我们的方法的平均间接费用,随后再执行的错误纠正费用昂贵,在CPU周期中为6.45%。我们还报告,ANNS发现有94.85%的故障,从而导致最小的产出质量退化。为了验证我们的方法,我们超时超时超ARM Cortex A53 CPUs, 执行基准并记录程序产出。 ANNS证明,比艺术近似误检机制(Topaz)的状态更好,从性能检测范围(12.81% CPU ) 和输出质量应用(1294%) CPU质量(1294) 。