Cerebras-GPT：在Cerebras晶圆级集群上训练的开放计算优化语言模型 (Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster) - 专知论文

会员服务 ·

0

计算优化 · 语言模型 · 预训练 · 数据集 · 扩展规则 ·

2023 年 4 月 6 日

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

翻译：Cerebras-GPT：在Cerebras晶圆级集群上训练的开放计算优化语言模型

Nolan Dey,Gurpreet Gosal, Zhiming, Chen,Hemant Khachane,William Marshall,Ribhu Pathria,Marvin Tom,Joel Hestness

from arxiv, 13 pages main text, 16 pages appendix, 13 figures

We study recent research advances that improve large language models through efficient pre-training and scaling, and open datasets and tools. We combine these advances to introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-training (highest accuracy for a given compute budget). We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models to show all Cerebras-GPT models have state-of-the-art training efficiency on both pre-training and downstream objectives. We describe our learnings including how Maximal Update Parameterization ($\mu$P) can further improve large model scaling, improving accuracy and hyperparameter predictability at scale. We release our pre-trained models and code, making this paper the first open and reproducible work comparing compute-optimal model scaling to models trained on fixed dataset sizes. Cerebras-GPT models are available on HuggingFace: https://huggingface.co/cerebras.

翻译：我们研究近期改进大型语言模型的方法，包括高效的预训练和扩展，以及开放的数据集和工具。我们将这些进展结合起来，引入了Cerebras-GPT，这是一系列从111M到13B参数扩展的开放计算优化语言模型。我们根据DeepMind Chinchilla的扩展规则，使用Eleuther Pile数据集对Cerebras-GPT模型进行训练，以实现高效的预训练(在给定计算预算下有最高的准确性)。我们表征可预测的幂律扩展特性，并将Cerebras-GPT与其他公开可用的模型进行比较，以表明所有Cerebras-GPT模型都具有最先进的预训练和下游目标的训练效率。我们描述了我们的学习过程，包括Maximal Update Parameterization（$\mu$P）如何进一步提高大型模型的扩展性，从而在规模上改善准确性和超参数可预测性。我们发布了我们的预训练模型和代码，使本文成为第一篇比较计算优化模型扩展和在固定数据集大小上训练的模型的开放和可复现的工作。Cerebras-GPT模型可在HuggingFace上获得：https://huggingface.co/cerebras。

0

相关内容

计算优化

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

ICLR | 训练面向分子模拟的十亿级参数 GNN

ICLR | 训练面向分子模拟的十亿级参数 GNN

专知会员服务

8+阅读 · 2022年6月27日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【ICML2021】无训练神经架构搜索

专知会员服务

20+阅读 · 2021年9月16日

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

专知会员服务

101+阅读 · 2020年10月13日

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

专知会员服务

26+阅读 · 2020年7月14日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

单机训练200亿参数大模型：Cerebras打破新纪录

单机训练200亿参数大模型：Cerebras打破新纪录

机器之心

1+阅读 · 2022年6月25日

谷歌&HuggingFace| 零样本能力最强的语言模型结构

谷歌&HuggingFace| 零样本能力最强的语言模型结构

夕小瑶的卖萌屋

0+阅读 · 2022年6月23日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

干货 | BERT fine-tune 终极实践教程

干货 | BERT fine-tune 终极实践教程

AINLP

40+阅读 · 2018年11月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

使用GPU加速银道面尘埃辐射图像的高分辨率模拟与多参数反演

国家自然科学基金

0+阅读 · 2015年12月31日

遗传性血管水肿（HAE）临床异质性的分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

ESM-1基因及其信号通路对小鼠肝再生的作用及分子机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

并行系统上大规模图中最短路径实时计算研究

国家自然科学基金

1+阅读 · 2013年12月31日

从Notch信号通路探讨miR-124a对骨髓间质干细胞内源性胰岛分化潜能的调控机制及对糖尿病肾病的保护作用

国家自然科学基金

0+阅读 · 2012年12月31日

基于AR-HMM的重型车辆侧翻预警模型与算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

两类Monge-Ampere方程问题的研究

国家自然科学基金

1+阅读 · 2012年12月31日

hTERT/Tet-on/GAL基因修饰BMSCs对慢性神经病理痛的可控性镇痛作用及其机制

国家自然科学基金

0+阅读 · 2011年12月31日

Intermedin-53在心肌肥厚中的作用和机制

国家自然科学基金

0+阅读 · 2011年12月31日

面向光子网格应用的上行多波长动态调度无源光网络

国家自然科学基金

0+阅读 · 2009年12月31日

Scaling Data-Constrained Language Models

Arxiv

0+阅读 · 2023年5月25日

Revisiting Non-Autoregressive Translation at Scale

Arxiv

0+阅读 · 2023年5月25日

Differentially Private Latent Diffusion Models

Arxiv

0+阅读 · 2023年5月25日

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Arxiv

0+阅读 · 2023年5月23日

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Arxiv

0+阅读 · 2023年5月23日

Narrative XL: A Large-scale Dataset For Long-Term Memory Models

Arxiv

0+阅读 · 2023年5月23日

Sequential Estimation using Hierarchically Stratified Domains with Latin Hypercube Sampling

Arxiv

0+阅读 · 2023年5月22日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Arxiv

11+阅读 · 2019年9月23日

VIP会员

文章信息

相关主题

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

ICLR | 训练面向分子模拟的十亿级参数 GNN

ICLR | 训练面向分子模拟的十亿级参数 GNN

专知会员服务

8+阅读 · 2022年6月27日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【ICML2021】无训练神经架构搜索

专知会员服务

20+阅读 · 2021年9月16日

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

【PKDD2020教程】可解释人工智能XAI:算法到应用，200页ppt

专知会员服务

101+阅读 · 2020年10月13日

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

【ICML2020】学习支持外推的表示学习，Learning Representations that Support Extrapolation

专知会员服务

26+阅读 · 2020年7月14日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《美陆军特种作战条令》最新102页

《洛克希德SR-71“黑鸟”侦察机动力系统》21页slides

美空军作战实验室通过人工智能和指挥控制技术创新推进杀伤链

《指挥控制能力分析方法论》最新报告

相关资讯

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

单机训练200亿参数大模型：Cerebras打破新纪录

单机训练200亿参数大模型：Cerebras打破新纪录

机器之心

1+阅读 · 2022年6月25日

谷歌&HuggingFace| 零样本能力最强的语言模型结构

谷歌&HuggingFace| 零样本能力最强的语言模型结构

夕小瑶的卖萌屋

0+阅读 · 2022年6月23日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

干货 | BERT fine-tune 终极实践教程

干货 | BERT fine-tune 终极实践教程

AINLP

40+阅读 · 2018年11月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

相关论文

Scaling Data-Constrained Language Models

Arxiv

0+阅读 · 2023年5月25日

Revisiting Non-Autoregressive Translation at Scale

Arxiv

0+阅读 · 2023年5月25日

Differentially Private Latent Diffusion Models

Arxiv

0+阅读 · 2023年5月25日

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Arxiv

0+阅读 · 2023年5月23日

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Arxiv

0+阅读 · 2023年5月23日

Narrative XL: A Large-scale Dataset For Long-Term Memory Models

Arxiv

0+阅读 · 2023年5月23日

Sequential Estimation using Hierarchically Stratified Domains with Latin Hypercube Sampling

Arxiv

0+阅读 · 2023年5月22日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Arxiv

11+阅读 · 2019年9月23日

相关基金

使用GPU加速银道面尘埃辐射图像的高分辨率模拟与多参数反演

国家自然科学基金

0+阅读 · 2015年12月31日

遗传性血管水肿（HAE）临床异质性的分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

ESM-1基因及其信号通路对小鼠肝再生的作用及分子机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

并行系统上大规模图中最短路径实时计算研究

国家自然科学基金

1+阅读 · 2013年12月31日

从Notch信号通路探讨miR-124a对骨髓间质干细胞内源性胰岛分化潜能的调控机制及对糖尿病肾病的保护作用

国家自然科学基金

0+阅读 · 2012年12月31日

基于AR-HMM的重型车辆侧翻预警模型与算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

两类Monge-Ampere方程问题的研究

国家自然科学基金

1+阅读 · 2012年12月31日

hTERT/Tet-on/GAL基因修饰BMSCs对慢性神经病理痛的可控性镇痛作用及其机制

国家自然科学基金

0+阅读 · 2011年12月31日

Intermedin-53在心肌肥厚中的作用和机制

国家自然科学基金

0+阅读 · 2011年12月31日

面向光子网格应用的上行多波长动态调度无源光网络

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员