IndicSuperTokenizer：面向印度多语言大语言模型的优化分词器 (IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs)

Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods such as Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present IndicSuperTokenizer, a tokenizer for Indic multilingual LLMs, that combines both subword and multi-word tokenization, along with language-specific pre-tokenization, leading to more linguistically aligned tokens and achieving a new state-of-the-art in fertility score. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by 39.5% over LLaMA4 and by 18% over Sutra (the current best). This translates to 44% improvement in inference throughput over LLaMA4 while maintaining comparable performance on English and Indic benchmarks. We also present detailed ablations across tokenizer training data size, vocabulary size, merging techniques, and pre-tokenization strategies, demonstrating the robustness of our design choices.

翻译：分词器在决定大语言模型的性能、训练效率和推理成本方面起着至关重要的作用。由于多样的文字和丰富的形态变化，为多语言大语言模型设计有效的分词器尤其具有挑战性。尽管子词方法如字节对编码被广泛采用，但其在多语言环境中的有效性仍未得到充分探索。我们提出了IndicSuperTokenizer，一种面向印度多语言大语言模型的分词器，它结合了子词和多词分词方法，并辅以语言特定的预分词处理，从而产生更符合语言学的词元，并在生育力分数上达到了新的最优水平。在英语、22种印度语言及代码数据上的评估表明，我们的分词器将平均生育力分数较LLaMA4提升了39.5%，较当前最优的Sutra提升了18%。这转化为推理吞吐量较LLaMA4提高了44%，同时在英语和印度语言基准测试中保持了相当的性能。我们还详细分析了分词器训练数据规模、词汇表大小、合并技术和预分词策略的影响，验证了我们设计选择的鲁棒性。