Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text, in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and Swedish), as well as some high-quality English data. This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.
翻译:大规模语言模型的预训练需要海量的文本数据,模型的性能通常与数据集的规模和质量相关。这意味着为北欧语言(如丹麦语、冰岛语、挪威语和瑞典语等)构建大规模语言模型可能会很具有挑战性,因为文本语料库的可用性很有限。为了促进北欧语言的大规模语言模型开发,我们精心策划了一个高质量的数据集,其中包括1.2TB的文本数据,覆盖了所有主要的北日耳曼语言以及一些高质量的英语数据。本文详细介绍了我们搜集、清理和过滤数据集的考虑和过程。