Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a generic, structured pruning approach by parameterizing each weight matrix using its low-rank factorization, and adaptively removing rank-1 components during training. On language modeling tasks, our structured approach outperforms other unstructured and block-structured pruning baselines at various compression levels, while achieving significant speedups during both training and inference. We also demonstrate that our method can be applied to pruning adaptive word embeddings in large language models, and to pruning the BERT model on several downstream fine-tuning classification benchmarks.
翻译:大型语言模型最近在各种自然语言任务中取得了艺术表现的状态。 同时,这些模型的规模及其长期性大大提高了,使得其使用成本很高,并提出了一个有趣的问题:语言模型需要大吗?我们通过模型压缩的镜头来研究这一问题。我们提出了一个通用的、结构化的裁剪方法,即使用低等级的因数化来参数化每个重量矩阵,并在培训期间以适应性的方式删除一级级1的成分。在语言模型任务方面,我们的结构化方法优于其他不同压缩水平的未结构化和区块结构的裁剪基线,同时在培训和推论期间都取得了显著的加速。我们还表明,我们的方法可以用于将适应性字嵌入大型语言模型,以及将BERT模型用于几个下游微调分类基准。