We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; LM ensembles with random data splits do not perform well. We also present a study of scaling BTM into a new corpus of 64 domains (192B whitespace-separated tokens in total); the resulting LM (22.4B total parameters) performs as well as a Transformer LM trained with 2.5 times more compute. These gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.
翻译:我们提出部门- Train- Merge (BTM), 这是一种沟通效率高的算法, 用来对大型语言模型进行令人尴尬的平行培训(LLMS ) 。 我们显示, 独立地对数据的不同子集培训新的LMS, 消除目前培训LMS所需的大规模多节同步 。 BTM 学习一套独立的独立专家LMS (ELM ), 每个都专门用于不同的文本领域, 如科学或法律文本 。 这些ELMS 可以添加和删除, 以更新数据覆盖范围, 整合为新域, 或者平均地将一个LMM 折叠回到一个单一的LM 中, 以便有效地进行平行推导。 我们通过从当前一组的ELM 分支中学习新的ELM, 进一步培训新域域域的LM, 并且将由此产生的模型合并到未来使用的设置中。 实验显示, BTM 与 GPT- 样式的变型LM LM 相比, 在控制总成本时, 时, 平均地将 折叠回到一个新的LM 。