Artificial intelligence (AI) and Machine Learning (ML) are becoming pervasive in today's applications, such as autonomous vehicles, healthcare, aerospace, cybersecurity, and many critical applications. Ensuring the reliability and robustness of the underlying AI/ML hardware becomes our paramount importance. In this paper, we explore and evaluate the reliability of different AI/ML hardware. The first section outlines the reliability issues in a commercial systolic array-based ML accelerator in the presence of faults engendering from device-level non-idealities in the DRAM. Next, we quantified the impact of circuit-level faults in the MSB and LSB logic cones of the Multiply and Accumulate (MAC) block of the AI accelerator on the AI/ML accuracy. Finally, we present two key reliability issues -- circuit aging and endurance in emerging neuromorphic hardware platforms and present our system-level approach to mitigate them.
翻译:人工智能(AI)和机器学习(ML)在当今的应用中日益普及,例如自主车辆、保健、航空航天、网络安全以及许多关键应用。确保基本的AI/ML硬件的可靠性和稳健性成为我们最重要的。在本文件中,我们探讨和评估不同的AI/ML硬件的可靠性。第一部分概述了商业的基于商用的Systoli 的阵列加速器的可靠性问题,其中存在由DRAM中设备级非理想产生的故障。接着,我们量化了AI加速器多式和累积式机群中MASB和LSB逻辑界的电路级故障对AI/ML准确性的影响。最后,我们提出了两个主要的可靠性问题 -- -- 新兴神经形态硬件平台的电路变化和耐久性,并提出我们系统级的缓解方法。