Deep Learning (DL) systems have proliferated in many applications, requiring specialized hardware accelerators and chips. In the nano-era, devices have become increasingly more susceptible to permanent and transient faults. Therefore, we need an efficient methodology for analyzing the resilience of advanced DL systems against such faults, and understand how the faults in neural accelerator chips manifest as errors at the DL application level, where faults can lead to undetectable and unrecoverable errors. Using fault injection, we can perform resilience investigations of the DL system by modifying neuron weights and outputs at the software-level, as if the hardware had been affected by a transient fault. Existing fault models reduce the search space, allowing faster analysis, but requiring a-priori knowledge on the model, and not allowing further analysis of the filtered-out search space. Therefore, we propose ISimDL, a novel methodology that employs neuron sensitivity to generate importance sampling-based fault-scenarios. Without any a-priori knowledge of the model-under-test, ISimDL provides an equivalent reduction of the search space as existing works, while allowing long simulations to cover all the possible faults, improving on existing model requirements. Our experiments show that the importance sampling provides up to 15x higher precision in selecting critical faults than the random uniform sampling, reaching such precision in less than 100 faults. Additionally, we showcase another practical use-case for importance sampling for reliable DNN design, namely Fault Aware Training (FAT). By using ISimDL to select the faults leading to errors, we can insert the faults during the DNN training process to harden the DNN against such faults. Using importance sampling in FAT reduces the overhead required for finding faults that lead to a predetermined drop in accuracy by more than 12x.
翻译:深度学习( DL) 系统在许多应用程序中扩散, 需要专门的硬件加速器和芯片。 在纳米时代, 设备越来越容易受到永久和瞬时故障的影响。 因此, 我们需要一种高效的方法来分析先进的 DL 系统对此类故障的弹性。 因此, 我们需要一种高效的方法来分析高级 DL 系统对此类故障的适应能力, 并理解神经加速器芯片的缺陷如何表现为 DL 应用程序级的错误, 其中错误可能导致无法检测和无法回收错误。 使用错误注入, 我们可以对 DL 系统进行弹性调查, 改变软件级的神经重量和输出, 就像硬件已经受到瞬时故障的影响一样。 因此, 我们需要一种高效的方法来分析高级 DL 系统 的适应能力, 从而让神经敏感度产生基于抽样的错误和无法回收的错误。 使用任何最先入为主的错误来进行弹性调查, 使用 ISIML 的精确度, 提供相等的精确度, 在现有的实验设计中, 显示我们现有的精确性设计过程, 能够用新的神经敏感方法来生成。</s>