图7展示了掩码训练的准确率变化曲线。我们发现在预训练 BERT 上进行掩码训练(ft step=0)的收敛速度比在微调后的 BERT 上慢(ft to end),特别是在 HANS OOD 数据集上。然而,如果以经过 20,000 步(ft to end 的约 55%)微调的 BERT 为起点,掩码训练的最终效果和 ft to end 就没有显著差距了。这说明整个微调-剪枝流程的开销可以在完整 BERT 微调阶段被节省很大一部分而不影响子网络最终效果。不过值得注意的是,在以上分析中我们只是证明了减少 BERT 微调步数的可行性。要真正减少训练开销,我们需要事先对开始掩码训练的时刻进行预测,对这个问题的进一步探究也是一个值得研究的未来方向。
[1] Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. E. Gonzalez. Train large, then compress: Rethinking model size for efficient training and inference of transformers. In ICML 2020.[2] P. Michel, O. Levy, and G. Neubig. Are sixteen heads really better than one? In NeurIPS 2019.[3] Y. Liu, Z. Lin, and F. Yuan. ROSITA: refined BERT compression with integrated techniques. In AAAI 2021.[4] T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, and M. Carbin. The lottery ticket hypothesis for pre-trained BERT networks. In NeurIPS 2020.[5] S. Prasanna, A. Rogers, and A. Rumshisky. When BERT plays the lottery, all tickets are winning. In EMNLP 2020.[6] Y. Liu, F. Meng, Z. Lin, P. Fu, Y. Cao, W. Wang, and J. Zhou. Learning to win lottery tickets in BERT transfer via task-agnostic mask training. In NAACL 2022.[7] C. Liang, S. Zuo, M. Chen, H. Jiang, X. Liu, P. He, T. Zhao, and W. Chen. Super tickets in pre-trained language models: From model compression to improving generalization. In ACL 2021.[8] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR 2019.[9] M. Zhao, T. Lin, F. Mi, M. Jaggi, and H. Schütze. Masking as an efficient alternative to finetuning for pretrained language models. In EMNLP 2020.[10] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In NIPS 2015.[11] A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV 2018.[12] Y. Bengio, N. Léonard, and A. C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432.[13] C. Clark, M. Yatskar, and L. Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In EMNLP 2019.[14] T. Schuster, D. J. Shah, Y. J. S. Yeo, D. Filizzola, E. Santus, and R. Barzilay. Towards debiasing fact verification models. In EMNLP 2019.[15] P. A. Utama, N. S. Moosavi, and I. Gurevych. Mind the trade-off: Debiasing NLU models without degrading the in-distribution performance. In ACL 2020.[16] M. Zhu and S. Gupta. To prune, or not to prune: Exploring the efficacy of pruning for model compression. In ICLR (Workshop) 2018.