Transformer language models such as GPT-2 are difficult to quantize because of outliers in activations leading to a large quantization error. To adapt to the error, one must use quantization-aware training, which entails a fine-tuning process based on the dataset and the training pipeline identical to those for the original model. Pretrained language models, however, often do not grant access to their datasets and training pipelines, forcing us to rely on arbitrary ones for fine-tuning. In that case, it is observed that quantization-aware training overfits the model to the fine-tuning data. For quantization without overfitting, we introduce a quantization adapter (Quadapter), a small set of parameters that are learned to make activations quantization-friendly by scaling them channel-wise. It keeps the model parameters unchanged. By applying our method to the challenging task of quantizing GPT-2, we demonstrate that it effectively prevents the overfitting and improves the quantization performance.
翻译:GPT-2等变换语言模型很难量化,因为启动时的外差导致大量化误差。 要适应错误, 就必须使用量子识别培训, 其中包括基于数据集和与原始模型相同的培训管道的微调过程。 但是, 预先培训的语言模型往往不准许访问其数据集和培训管道, 迫使我们依赖任意的数据集和培训管道进行微调。 在这种情况下, 人们发现, 量子识别培训使模型比微调数据更适合该模型。 为了不作过度调整, 我们引入了量化调整适应器( Quadapter), 这是一套小的参数, 用来通过扩缩通道使启动量子化变得方便。 它使模型参数保持不变。 通过应用我们的方法来完成量化 GPT-2 的艰巨任务, 我们证明它有效防止了夸大和改良性表现的过度。