We study recent research advances that improve large language models through efficient pre-training and scaling, and open datasets and tools. We combine these advances to introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-training (highest accuracy for a given compute budget). We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models to show all Cerebras-GPT models have state-of-the-art training efficiency on both pre-training and downstream objectives. We describe our learnings including how Maximal Update Parameterization ($\mu$P) can further improve large model scaling, improving accuracy and hyperparameter predictability at scale. We release our pre-trained models and code, making this paper the first open and reproducible work comparing compute-optimal model scaling to models trained on fixed dataset sizes. Cerebras-GPT models are available on HuggingFace: https://huggingface.co/cerebras.
翻译:我们研究近期改进大型语言模型的方法,包括高效的预训练和扩展,以及开放的数据集和工具。我们将这些进展结合起来,引入了Cerebras-GPT,这是一系列从111M到13B参数扩展的开放计算优化语言模型。我们根据DeepMind Chinchilla的扩展规则,使用Eleuther Pile数据集对Cerebras-GPT模型进行训练,以实现高效的预训练(在给定计算预算下有最高的准确性)。我们表征可预测的幂律扩展特性,并将Cerebras-GPT与其他公开可用的模型进行比较,以表明所有Cerebras-GPT模型都具有最先进的预训练和下游目标的训练效率。我们描述了我们的学习过程,包括Maximal Update Parameterization($\mu$P)如何进一步提高大型模型的扩展性,从而在规模上改善准确性和超参数可预测性。我们发布了我们的预训练模型和代码,使本文成为第一篇比较计算优化模型扩展和在固定数据集大小上训练的模型的开放和可复现的工作。Cerebras-GPT模型可在HuggingFace上获得:https://huggingface.co/cerebras。