With the rapid advancements of deep learning in recent years, hardware accelerators are continuously deployed in more and more safety-critical applications such as autonomous driving and robotics. While the accelerators are usually fabricated with advanced technology nodes for high performance and energy efficiency, they are also more prone to timing errors under process, voltage, temperature, and aging (PVTA) variations. By revisiting the physical sources of timing errors, we show that most of the timing errors in the accelerator are caused by a specific subset of input patterns, defined as critical input patterns. To improve the timing error resilience of the accelerator, in this paper, we propose READ, a reliability-enhanced accelerator dataflow optimization technique that can effectively reduce timing errors. READ reduces the occurrence of critical input patterns by exploring the optimal computing sequence when mapping a trained deep neural network to accelerators. READ only changes the order of multiply-accumulate operations in a convolution, which introduces negligible hardware overhead and no impact on accuracy. The experimental results on VGG and ResNet demonstrate on average 7.8X timing error rate (TER) reduction and up to 37.9X TER reduction for certain layers. The results also show that READ enables the accelerator to maintain accuracy over a wide range of PVTA variations, making it a promising approach for robust deep-learning design
翻译:暂无翻译