Large language models are typically trained densely: all parameters are updated with respect to all inputs. This requires synchronization of billions of parameters across thousands of GPUs. We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. Our technique outperforms dense baselines on multiple corpora and few-shot tasks, and our analysis shows that specializing experts to meaningful clusters is key to these gains. Performance also improves with the number of experts and size of training data, suggesting this is a highly efficient and accessible approach to training large language models.
翻译:大型语言模型通常密集训练:所有参数都根据所有输入进行更新。这需要在数千个 GPU 上同步数十亿个参数。我们介绍了一种简单但有效的方法,以异步方式在任意文本语料库上训练大型、稀疏的语言模型。我们的方法将一个语料库聚类成一组相关的文档,为每个集群训练一个单独的专家语言模型,并将它们组合成稀疏集合以进行推理。这种方法通过自动发现每个专家的领域泛化了可耻的并行训练,并消除了现有稀疏语言模型的几乎所有通信开销。我们的技术在多个语料库和少量样本任务上优于密集基准,我们的分析表明,将专家专门化为有意义的集群对这些收益至关重要。性能还随着专家数量和训练数据的规模而提高,表明这是一种高效简便的训练大型语言模型的方法。