Recurrent neural networks (RNN) are the backbone of many text and speech applications. These architectures are typically made up of several computationally complex components such as; non-linear activation functions, normalization, bi-directional dependence and attention. In order to maintain good accuracy, these components are frequently run using full-precision floating-point computation, making them slow, inefficient and difficult to deploy on edge devices. In addition, the complex nature of these operations makes them challenging to quantize using standard quantization methods without a significant performance drop. We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN). Our approach supports layer normalization, attention, and an adaptive piecewise linear (PWL) approximation of activation functions, to serve a wide range of state-of-the-art RNNs. The proposed method enables RNN-based language models to run on edge devices with $2\times$ improvement in runtime, and $4\times$ reduction in model size while maintaining similar accuracy as its full-precision counterpart.
翻译:经常神经网络(RNN)是许多文字和语音应用的支柱。这些结构通常由若干计算复杂的组成部分组成,如非线性激活功能、正常化、双向依赖性和注意。为了保持准确性,这些组成部分经常使用全精度浮点计算进行运行,使其缓慢、低效和难以在边缘装置上部署。此外,由于这些操作的复杂性,它们难以在不显著性能下降的情况下使用标准量化方法进行量化。我们提出了一个量化认知培训方法,以获得高度精确的单单然常规神经网络(iRNN ) 。我们的方法支持层正常化、注意和适应性直线性线性(PWL)功能,以服务于多种状态的精密RNNN。拟议方法使基于RNN语言模型在边缘装置上运行,在运行时改进2美元,在模型大小上减少4美元,同时保持与全精度对应的精确度。