Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.
翻译:使用多种语言的大型各种建筑的预先培训模式,对于国家语言平台的民主化至关重要。我们引入了法国GPT模型集PAGnol。我们使用比例法,有效地培训PAGnol-XL(1.5B参数),其计算预算与CamemBERT(模型小13倍)相同。PAGnol-XL是迄今为止为法语语言培训的最大模型。我们计划培训日益庞大和出色的PAGnol版本,探索法国极端规模模型的能力。关于这首期,我们侧重于培训前和比例计算,强调PAGNol。我们用比例法设计了一个缩写法,并将其与英文对应方进行比较。我们发现,培训前的数据大大地满足了产出的质量,如OSCAR导致低质量攻击性文字的通用数据集。我们评估了我们法语中的歧视性和基因化任务模型,与其他法国和多语言模型进行比较,并达到抽象合成法语的艺术状态。我们所制作的GANGEN Genero的高级模型是公开进行的研究。