While large language models \`a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours, using only 8 low-range 12GB GPUs. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.
翻译:虽然大型语言模型“a la la BERT”在荷兰语言项目中被普遍使用,但预培训则被视为只有少数资金充足的工业实验室能够负担的奢侈品。 如何用更低的预算来培训这类模型? 我们提出一种在24小时内对隐形语言模型进行预培训的秘方,只使用8个低程12GB GPUs。 我们证明,通过软件优化、设计选择和超参数调等组合,有可能以最初培训前成本的一小部分来制作与德国语言小组GLUE任务基地具有竞争力的模型。