Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity. We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. We propose that the recent work done in Performance Prediction for NLP tasks can serve as a potential solution in fixing benchmarking in Multilingual NLP by utilizing features related to data and language typology to estimate the performance of an MMLM on different languages. We compare performance prediction with translating test data with a case study on four different multilingual datasets, and observe that these methods can provide reliable estimates of the performance that are often on-par with the translation based approaches, without the need for any additional translation as well as evaluation costs.
翻译:虽然最近MBERT和XLMR等大规模多种语言模式(MMLM)支持了大约100种语言,但大多数现有的多语言国家语言方案基准只提供少数语言的评价数据,语言多样性很少。我们争辩说,这使得多语言评价的现行做法不可靠,不能全面说明语言环境中现有多语言模式的绩效。我们提议,最近在国家语言方案任务业绩预测方面所做的工作,可以作为确定多语言国家语言方案基准的一个潜在解决办法,办法是利用与数据和语言类型有关的特征来估计不同语言MMMLM的绩效。我们将业绩预测与翻译测试数据进行比较,同时进行关于四种不同多种语言数据集的案例研究。我们指出,这些方法可以提供可靠的业绩估计,这些业绩往往与基于翻译的方法相提并论,不需要额外的翻译以及评估费用。