Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4 - 4.0x for INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has been open-sourced.
翻译:以变压器为基础的模型, 如 BERT 和 RoBERTA 等, 在许多自然语言处理任务中取得了最先进的结果。 但是, 他们的记忆足迹、 推推延率和电量消耗对于边缘甚至数据中心的有效推算来说是令人望而却步的。 虽然量化可以成为可行的解决办法, 先前关于以变压器为基础的模型的量化工作在推论期间使用浮点算法, 而在推论期间无法有效利用像最近的图灵天线核心或传统的仅限整数的ARM处理器这样的整数逻辑单位。 在这项工作中, 我们建议使用基于变压动器的模型的I-BERT, 一个新的量化方法, 用仅限值的算法来量化整个推算。 以轻量整数的整数整数方法为非线性操作, 例如, GLU, Softmax, 和图层常态化, I-BERT 进行端端至端点计算。 我们用罗贝塔- BB / LA 的精确度框架 进行了对比。 我们的I- bro- bro- deal- 展示了I- 的完整执行过程, 和 2.EERB 。 我们用I- serview- be 显示了I- 和 2.- be 和 2.- b- be 的完整 B 的完整的完整的完整的完整。