Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. However, these models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. Knowledge distillation, Weight pruning, and Quantization are known to be the main directions in model compression. However, compact models obtained through knowledge distillation may suffer from significant accuracy drop even for a relatively small compression ratio. On the other hand, there are only a few quantization attempts that are specifically designed for natural language processing tasks. They suffer from a small compression ratio or a large error rate since manual setting on hyper-parameters is required and fine-grained subgroup-wise quantization is not supported. In this paper, we proposed an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level. Specifically, our proposed method leverages Differentiable Neural Architecture Search to assign scale and precision for parameters in each sub-group automatically, and at the same time pruning out redundant groups of parameters. Extensive evaluations on BERT downstream tasks reveal that our proposed method outperforms baselines by providing the same performance with much smaller model size. We also show the feasibility of obtaining the extremely light-weight model by combining our solution with orthogonal methods such as DistilBERT.
翻译:在各种自然语言处理任务中,诸如BERT等经过事先培训的语言模型显示了显著的效益。然而,这些模型通常包含数百万个参数,这些参数使它们无法在资源限制的装置上实际部署。已知的是模型压缩的主要方向是知识蒸馏、重量调整和量化。然而,即使对于相对较小的压缩比例,通过知识蒸馏获得的精密模型也可能受到显著的精度下降的影响。另一方面,只有为数不多的量化尝试是专门为自然语言处理任务设计的。这些模型存在小压缩率或大误差率,因为它们无法在资源限制的装置上进行实际部署。在本文中,我们建议为BERT设计的一个自动混合精度四分化框架,既可以同时进行二次量化,也可以在一个相对较小的压缩比例上运行。具体地说,我们提议的方法是利用可区分的神经结构搜索来为每个子组的参数指定尺度和精确度。它们受到小压缩的比例或大错率差率,因为需要人工设定超标值的参数,而精细分的分组则不支持。在本文中,我们建议为BERERT设计的一个自动的分级的分级的分级的分级计算框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架,通过我们的拟议的较小的模型展示了我们的拟议的模型展示的模型展示了我们提出的最轻的大小的模型展示的模型,从而展示了我们提出的标准。我们提出的最轻的模型展示的基线,我们提出的标准。