TliltedBERT: 可调整资源版本的BERT (TiltedBERT: Resource Adjustable Version of BERT)

In this paper, we proposed a novel adjustable finetuning method that improves the training and inference time of the BERT model on downstream tasks. In the proposed method, we first detect more important word vectors in each layer by our proposed redundancy metric and then eliminate the less important word vectors with our proposed strategy. In our method, the word vector elimination rate in each layer is controlled by the Tilt-Rate hyper-parameter, and the model learns to work with a considerably lower number of Floating Point Operations (FLOPs) than the original BERTbase model. Our proposed method does not need any extra training steps, and also it can be generalized to other transformer-based models. We perform extensive experiments that show the word vectors in higher layers have an impressive amount of redundancy that can be eliminated and decrease the training and inference time. Experimental results on extensive sentiment analysis, classification and regression datasets, and benchmarks like IMDB and GLUE showed that our proposed method is effective in various datasets. By applying our method on the BERTbase model, we decrease the inference time up to 5.3 times with less than 0.85% accuracy degradation on average. After the fine-tuning stage, the inference time of our model can be adjusted with our method offline-tuning property for a wide range of the Tilt-Rate value selections. Also, we propose a mathematical speedup analysis that can estimate the speedup of our method accurately. With the help of this analysis, the proper Tilt-Rate value can be selected before fine-tuning or while offline-tuning stages.

翻译：在本文中,我们提出了一个新的可调整微调方法,改进BERT模式在下游任务方面的培训和推算时间。在拟议方法中,我们首先通过拟议的冗余度衡量标准发现每个层中更重要的字矢量,然后用我们拟议的战略消除较不重要的字矢量。在方法中,每个层的字矢量消除率由Tilt-Rate超参数控制,模型学会使用比原BERTBBB数据库模型少得多的浮点操作(FLOOPs)数量。我们建议的方法不需要任何额外的培训步骤,它也可以推广到其他变异器模型。我们进行广泛的实验,显示更高层的字矢量有惊人的冗余,可以消除,减少培训和推断时间。关于广泛情绪分析、分类和回归数据集的实验结果,以及IMDB和GLUE等基准显示,我们提出的方法在各种数据集中是有效的。通过在BERTBase模型中应用我们采用的方法,我们提出的方法可以减少精细的推时间到其他变异器基模型。我们所选择的精度分析阶段的精度调整了5.3次,而我们选择的精度的精度的精度的精度的精度则在比的精度的精度的精度的精度的精度分析范围范围中可以调整了。

相关内容