SPDF: 稀疏预训练和密集微调用于大型语言模型 (SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models)

The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also leads to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity, and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity while retaining the benefits of pre-trained textual representations for downstream tasks.

翻译：预训练和微调范式为自然语言处理(NLP)做出了许多突破性的贡献。语言模型首先在跨领域知识的大型数据集上进行预训练(例如 Pile、MassiveText等)，然后在特定于任务的数据上进行微调(例如自然语言生成、文本摘要等)。扩大模型和数据集的规模有助于提高LLM的性能，但不幸的是，这也导致了极其禁止性的计算成本。预训练LLM通常需要比微调多数个数量级的FLOP，而两个阶段之间的模型容量通常保持不变。为了在训练FLOP方面实现训练效率，我们提出了在两个阶段之间解耦模型容量并引入稀疏预训练和密集微调(SPDF)。在这项工作中，我们展示了使用非结构化的权重稀疏性在预训练期间只训练子集权重(Sparse Pre-training)，然后通过允许零化的权重进行学习来恢复代表性能力(Dense Fine-tuning)的好处。我们证明了我们可以在1.3B参数GPT-3 XL模型中引入高达75%的稀疏性，从而使预训练FLOP减少了2.5倍，而在下游任务上与密集基线相比没有显著的准确性损失。通过严格评估多个下游任务，我们还建立了稀疏性、任务复杂度和数据集大小之间的关系。我们的工作提出了一个有前途的方向，使用权重稀疏性训练大型GPT模型，这样可以使用预训练的文本表示保留下游任务的好处，并且仅花费训练FLOP的一小部分。