BERT自动混合精密度量搜索 (Automatic Mixed-Precision Quantization Search of BERT)

Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. However, these models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. Knowledge distillation, Weight pruning, and Quantization are known to be the main directions in model compression. However, compact models obtained through knowledge distillation may suffer from significant accuracy drop even for a relatively small compression ratio. On the other hand, there are only a few quantization attempts that are specifically designed for natural language processing tasks. They suffer from a small compression ratio or a large error rate since manual setting on hyper-parameters is required and fine-grained subgroup-wise quantization is not supported. In this paper, we proposed an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level. Specifically, our proposed method leverages Differentiable Neural Architecture Search to assign scale and precision for parameters in each sub-group automatically, and at the same time pruning out redundant groups of parameters. Extensive evaluations on BERT downstream tasks reveal that our proposed method outperforms baselines by providing the same performance with much smaller model size. We also show the feasibility of obtaining the extremely light-weight model by combining our solution with orthogonal methods such as DistilBERT.

翻译：在各种自然语言处理任务中,诸如BERT等经过事先培训的语言模型显示了显著的效益。然而,这些模型通常包含数百万个参数,这些参数使它们无法在资源限制的装置上实际部署。已知的是模型压缩的主要方向是知识蒸馏、重量调整和量化。然而,即使对于相对较小的压缩比例,通过知识蒸馏获得的精密模型也可能受到显著的精度下降的影响。另一方面,只有为数不多的量化尝试是专门为自然语言处理任务设计的。这些模型存在小压缩率或大误差率,因为它们无法在资源限制的装置上进行实际部署。在本文中,我们建议为BERT设计的一个自动混合精度四分化框架,既可以同时进行二次量化,也可以在一个相对较小的压缩比例上运行。具体地说,我们提议的方法是利用可区分的神经结构搜索来为每个子组的参数指定尺度和精确度。它们受到小压缩的比例或大错率差率,因为需要人工设定超标值的参数,而精细分的分组则不支持。在本文中,我们建议为BERERT设计的一个自动的分级的分级的分级的分级计算框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架框架,通过我们的拟议的较小的模型展示了我们的拟议的模型展示的模型展示了我们提出的最轻的大小的模型展示的模型,从而展示了我们提出的标准。我们提出的最轻的模型展示的基线,我们提出的标准。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/