Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4-4.0x for INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has been open-sourced.
翻译:以变压器为基础的模型, 如 BERT 和 RoBERTA 等, 在许多自然语言处理任务中, 已经实现了最先进的结果。 但是, 他们的记忆足迹、 推推线延缓率和电力消耗在边缘甚至数据中心都是令人望而却步的高效推断。 虽然量化可以是一个可行的解决方案, 先前关于以四分制为基础的变压器模型的工作在推论期间使用浮点算法, 无法有效利用像最近图灵天线核心这样的整点逻辑单位, 或传统的仅限整数的ARM处理器。 在这项工作中, 我们建议I-BERT, 一种基于变压模型的新型量化方案, 用仅限数算法算出整个推价。 以轻量整数的整数整数方法为基础, 例如, GLU, Softmax, 和图层正常化, I-BER 在不作任何浮动点计算的情况下, 我们用罗贝塔- BB / LA 的下游法方法评估了我们I- 的下游点框架, 比较了TB 的精确度, 和 2. 。 显示, 我们的初的精确性框架 和 2.B 显示了我们两个案例的完整的完整。