Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods. In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) we can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.
翻译:极端压缩,特别是超低位精度(双倍/永久)量度,建议用于大型NLP资源约束装置模型;然而,为了保持这种积极压缩计划的准确性,尖端方法通常采用复杂的压缩管道,例如,多阶段昂贵的知识蒸馏,并进行广泛的超光度参数调试。此外,它们往往不那么注重通过知识蒸馏已经大量压缩的小型变压器模型,缺乏系统的研究以显示其方法的有效性。在本文中,我们进行了一项非常全面的系统研究,以衡量许多关键超光度计和以前工程的培训战略的影响。结果,我们发现以往超低位精度四分化的基线严重缺乏培训。根据我们的研究,我们建议为极端压缩建立一个简单而有效的压缩管道,称为XTC。 XTC 表明:(1) 我们可以放弃培训前知识蒸馏,以获得5级BERT,同时取得比以往的状态方法更好的性能。例如,6级超光度超分度的四级四分位计算结果。