We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred simultaneously. Through intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning.
翻译:我们提出了一种使用预先培训语言模式的有效方法,我们通过这种方法学习了有选择的二元面罩,以取代通过微调对未经培训的重量进行修改。对隐蔽 BERT 和 RoBERTA 的一系列NLP 任务进行的广泛评估表明,我们的掩罩计划能产生与微调相当的性能,但在需要同时推导几项任务时,其记忆足迹要小得多。通过内在评估,我们表明,由隐蔽语言模式计算出来的表达方式将解决下游任务所必需的信息编码起来。分析损失情况,我们显示,遮罩和微调可以产生由直线段连接的微型模型,且几乎始终测试准确性。这证实,遮罩可以用作微调的有效替代方法。