Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.
翻译:分词是大型语言模型(LLM)的关键预处理步骤,直接影响训练效率和下游性能。主要在英语和拉丁文字语言上训练的通用分词器在阿拉伯语等形态丰富的语言上表现欠佳,导致标记序列膨胀并降低压缩效率。本研究提出AraToken,一种基于SentencePiece Unigram算法优化的阿拉伯语分词器,其配备的综合性规范化流程可处理阿拉伯语特有的正字法变体,包括Alif变体、变音符号和阿拉伯-印度数字。我们系统比较了BPE、WordPiece和SentencePiece算法在多种配置下的表现,证明经过规范化的SentencePiece相比未规范化的基线实现了18%的生育率降低(1.199 vs 1.35 标记/词)。此外,我们提出了语言扩展流程(LEP),该方法通过均值子标记初始化的词汇扩展及选择性Transformer层解冻,将优化后的分词器集成至Qwen3-0.6B。实验表明,在10万阿拉伯语样本上训练800步后,LEP将评估损失从8.28降至2.43。我们公开了分词器、训练脚本和模型检查点,以促进阿拉伯语自然语言处理研究。