Recent studies on the lottery ticket hypothesis (LTH) show that pre-trained language models (PLMs) like BERT contain matching subnetworks that have similar transfer learning performance as the original PLM. These subnetworks are found using magnitude-based pruning. In this paper, we find that the BERT subnetworks have even more potential than these studies have shown. Firstly, we discover that the success of magnitude pruning can be attributed to the preserved pre-training performance, which correlates with the downstream transferability. Inspired by this, we propose to directly optimize the subnetwork structure towards the pre-training objectives, which can better preserve the pre-training performance. Specifically, we train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork, which is agnostic to any specific downstream tasks. We then fine-tune the subnetworks on the GLUE benchmark and the SQuAD dataset. The results show that, compared with magnitude pruning, mask training can effectively find BERT subnetworks with improved overall performance on downstream tasks. Moreover, our method is also more efficient in searching subnetworks and more advantageous when fine-tuning within a certain range of data scarcity. Our code is available at https://github.com/llyx97/TAMT.
翻译:关于彩票假设(LTH)的最近研究显示,像BERT这样的预先培训语言模型(PLM)含有与原PLM相似的转移学习性能的匹配子网络。这些子网络使用基于规模的裁剪。在本文中,我们发现BERT子网络比这些研究所显示的更具有更大的潜力。首先,我们发现,规模调整的成功可归因于与下游可转移性相关的保留的培训前业绩。受此启发,我们建议直接优化子网络结构,使其与培训前目标相匹配,从而更好地维护培训前的绩效。具体地说,我们在培训前的任务中,对模型重量进行双面面罩的培训,目的是维护该子网络的普遍可转移性。对于任何具体的下游任务,我们发现这种可忽略的可能性更大。然后我们微调了GLUE基准和SQUAD数据集上的子网络。结果显示,与规模调整相比,遮盖培训可以有效地发现BERT子网络在下游任务上的总体性能得到改善。此外,我们在下游任务中搜索某些高利域网时,我们的方法也是在GMT/MTA范围内进行更高效的搜索。