One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.
翻译:在精调预训练语言模型(PLMs)时,面临的挑战之一是,Tokenizer 主要针对预训练语言进行了优化,但当遇到未见过的变化时,对数据脆弱。例如,在一种语言上进行精调并在某种紧密相关的语言变体的数据上进行评估时,尽管存在高度的语言相似性,但分词不再对目标数据的有意义表示。导致如词性标注等任务的低性能。在这项工作中,我们针对三个不同语系的七种语言进行了精调 PLMs,并分析了它们在紧密相关的非标准化语言变体上的零-shot性能。我们考虑了源数据和目标数据的分词差异的不同度量方式,以及在精调过程中如何通过操作分词来调整它们。总体而言,我们发现,源数据和目标数据中被分割成子词的单词比例之间的相似性(分割单词比例差异)是模型在目标数据上表现的最强预测因素。