There has been a rapid advance of custom hardware (HW) for accelerating the inference speed of deep neural networks (DNNs). Previously, the softmax layer was not a main concern of DNN accelerating HW, because its portion is relatively small in multi-layer perceptron or convolutional neural networks. However, as the attention mechanisms are widely used in various modern DNNs, a cost-efficient implementation of softmax layer is becoming very important. In this paper, we propose two methods to approximate softmax computation, which are based on the usage of LookUp Tables (LUTs). The required size of LUT is quite small (about 700 Bytes) because ranges of numerators and denominators of softmax are stable if normalization is applied to the input. We have validated the proposed technique over different AI tasks (object detection, machine translation, sentiment analysis, and semantic equivalence) and DNN models (DETR, Transformer, BERT) by a variety of benchmarks (COCO17, WMT14, WMT17, GLUE). We showed that 8-bit approximation allows to obtain acceptable accuracy loss below $1.0\%$.
翻译:用于加速深神经网络(DNN)推导速度的定制硬件(HW)迅速发展。以前,软模层不是DNN加速的HW的主要关切,因为它在多层感官或进化神经网络中的比例相对较小,然而,由于各种现代DNN广泛使用关注机制,以具有成本效益的方式实施软模层变得非常重要。在本文件中,我们提出两种方法,以各种基准(CO17、WMT14、WMT17、GLUE)为基础,以近似软模计算。LUT所需的尺寸相当小(约700位),因为如果投入正常化,软模的数值和分母范围是稳定的。我们已经验证了对不同的AI任务(对象检测、机器翻译、情绪分析和语义等值)和DNNN(DETR、变压器、BERT)的拟议技术,我们显示,8位接近值的精确度可以低于1.0。