Huge corpora of textual data are always known to be a crucial need for training deep models such as transformer-based ones. This issue is emerging more in lower resource languages - like Farsi. We propose naab, the biggest cleaned and ready-to-use open-source textual corpus in Farsi. It contains about 130GB of data, 250 million paragraphs, and 15 billion words. The project name is derived from the Farsi word NAAB K which means pure and high grade. We also provide the raw version of the corpus called naab-raw and an easy-to-use preprocessor that can be employed by those who wanted to make a customized corpus.
翻译:人们总是知道,大量文本数据是培训诸如变压器等深层模型的关键需要。这个问题正在以较低资源语言(如法西语)出现。我们提议用法西语(naab),这是法西语中最大的清洁和现成的开放源代码文本。它包含约1.3亿GB的数据、2.5亿段和150亿字。项目名称来自法西语NAABK, 意思是纯度和高等级。我们还提供了称为naab-raw 的原始版本, 以及一个易于使用的预处理器, 供那些想要制作个性化文件的人使用。