Common designs of model evaluation typically focus on monolingual settings, where different models are compared according to their performance on a single data set that is assumed to be representative of all possible data for the task at hand. While this may be reasonable for a large data set, this assumption is difficult to maintain in low-resource scenarios, where artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental. To address these concerns, we investigate model generalizability in crosslinguistic low-resource scenarios. Using morphological segmentation as the test case, we compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families. In each experimental setting, we evaluate all models on a first data set, then examine their performance consistency when introducing new randomly sampled data sets with the same size and when applying the trained models to unseen test sets of varying sizes. The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size. Among the characteristics that we studied, the ratio of morpheme overlap and that of the average number of morphemes per word between the training and test sets are the two most prominent factors. Our findings suggest that future work should adopt random sampling to construct data sets with different sizes in order to make more responsible claims about model evaluation.
翻译:模型评价的共同设计通常侧重于单一语言环境,不同模型根据其在假定代表手头任务的所有可能数据的单一数据集的性能,对不同的模型进行不同的比较;虽然对大型数据集来说,这也许是合理的,但在低资源情景中,这种假设很难维持,因为数据收集的手工艺品能够产生出出出出出处的数据集,有可能对模型性能同时期作出结论;为解决这些关切,我们调查了跨语言低资源情景中的典型通用性。在测试中,我们用形态分化来比较三大类模型和不同参数化的数据,取自来自6种语言组的11种语言的数据。在每次试验中,我们评估所有模型在第一个数据集上的所有模型,然后在采用大小相同的随机抽样新数据集时,以及在将经过训练的模型应用于不同规模的无形测试数据集时,检查其性能是否一致。结果表明,模型性一般化的程度取决于数据集的特性,而不一定大量依赖数据集的规模。在我们研究的特征中,关于标准重叠的比重比例是来自6种语言组的11种语言组的数据。在每次实验中,然后在采用新的随机抽样序列中,我们对每个样本进行最有代表性的抽样的测测定的数值。