Deep Neural Networks (DNNs) have achieved tremendous success for cognitive applications. The core operation in a DNN is the dot product between quantized inputs and weights. Prior works exploit the weight/input repetition that arises due to quantization to avoid redundant computations in Convolutional Neural Networks (CNNs). However, in this paper we show that their effectiveness is severely limited when applied to Fully-Connected (FC) layers, which are commonly used in state-of-the-art DNNs, as it is the case of modern Recurrent Neural Networks (RNNs) and Transformer models. To improve energy-efficiency of FC computation we present CREW, a hardware accelerator that implements Computation Reuse and an Efficient Weight Storage mechanism to exploit the large number of repeated weights in FC layers. CREW first performs the multiplications of the unique weights by their respective inputs and stores the results in an on-chip buffer. The storage requirements are modest due to the small number of unique weights and the relatively small size of the input compared to convolutional layers. Next, CREW computes each output by fetching and adding its required products. To this end, each weight is replaced offline by an index in the buffer of unique products. Indices are typically smaller than the quantized weights, since the number of unique weights for each input tends to be much lower than the range of quantized weights, which reduces storage and memory bandwidth requirements. Overall, CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage. We evaluate CREW on a diverse set of modern DNNs. On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator. Compared to UCNN, a state-of-art computation reuse technique, CREW achieves 2.10x speedup and 2.08x energy savings on average.
翻译:深神经网络(DNNs) 在认知应用方面取得了巨大成功。 DNN 的核心操作是量化的现代输入和重量之间的点值产品。 先前的作品利用了由于量化而导致的权重/ 量重复,以避免在进化神经网络(CNNs)中进行重复计算。 但是,在本文中,我们表明,当应用到完全连接的(FC)层时,其有效性受到严重限制,后者通常用于最先进的 DNNN, 因为它是现代的经常性神经网络(RNN)和变异器模型。为了提高FC计算中的能源效率,我们展示了CREW,这是一个硬件的硬件加速器,用于计算进化神经网络网络中大量重复的权重。 CREW首先通过各自的投入和存储模型的倍增倍增。 储存需求是小的, 由最小的重量和较小规模的NNNNNNW值的内值, 和最小的内存量要求, 在每部的存储量中, 将每个内存的内存的内值降低内积的内积的内积的内积, 的内存的内积通常由不断更新的内积 。 的内存的内存的内积 。