Non-linear operations such as GELU, Layer normalization, and Softmax are essential yet costly building blocks of Transformer models. Several prior works simplified these operations with look-up tables or integer computations, but such approximations suffer inferior accuracy or considerable hardware cost with long latency. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference. Our framework employs a simple neural network as a universal approximator with its structure equivalently transformed into a LUT. The proposed framework called NN-LUT can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.
翻译:非线性操作,如GELU、图层正常化和Softmax等,是变异模型的基本但成本高昂的构件。前几部工程用查看表或整数计算方法简化了这些操作,但这类近似值的准确性低,或硬件成本高,且长期潜伏。本文件为高效变异器推断提出了一个准确和硬件友好的近似框架。我们的框架使用一个简单的神经网络作为通用近似器,其结构也相应转化为LUT。拟议的称为NN-LUT的框架可以准确地取代流行的BERT模型中的所有非线性操作,在面积、电力消耗和延缓度方面大幅减少。