Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.
翻译:本地化是一个重要文本预处理步骤, 用于为深语言模型准备输入符号。 WordPiece 和 BPE 是BERT 和 GPT 等重要模型采用的实际方法。 但是, 象征性化对于形态上丰富的语言, 如突厥语的影响可能不同, 许多单词可以通过添加前缀和后缀生成。 我们比较了五种不同微粒层次的象征性品, 即它们的产出从最小字符到表面文字形式, 包括一个病理级符号。 我们用ROBERTA 预培训程序在OSCAR Camp土耳其语的分块上培训这些象征品和中等语言模式。 然后我们微调我们的模式在六个下游任务上。 我们的实验在统计测试的支持下, 显示, 数学级象征性品与事实上的象征品不同。 此外, 我们发现, 增加词汇的大小比事实上的象征品化器更能改善Mophlogy和Word级符号的性能。 合理的词汇数与40号的实际值比, 在40号中, 获得实际的40 运价比总的40 的40 度参数比。