The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name $\textit{perplexity sampling}$ that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget. Our models are available at this $\href{https://huggingface.co/bertin-project}{URL}$.
翻译:在计算和数据方面,对大型语言模型的培训前通常需要大量的资源。常见的网络资源,如常见的爬行等,可能含有足够的噪音,使得这种培训前的次最佳。在这项工作中,我们试验了西班牙版的 mC4 的不同取样方法,并提出了一种新的以数据为中心的技术,我们用$\ textit{plexity explication} 来命名这种技术,使语言模型的培训前能够使用大约一半的步数,并使用数据的五分之一。所产生的模型可以与目前的最新模型相比,甚至为某些任务取得更好的结果。我们的工作证明了变异器的多功能,并为小团队在有限的预算范围内培训模型铺平了道路。我们的模型可以使用$\href{https://huggingface.co/bertin-project ⁇ urL}$。