This paper introduces a Scandinavian benchmarking platform, ScandEval, which can benchmark any pretrained model on four different tasks in the Scandinavian languages. The datasets used in two of the tasks, linguistic acceptability and question answering, are new. We develop and release a Python package and command-line interface, scandeval, which can benchmark any model that has been uploaded to the Hugging Face Hub, with reproducible results. Using this package, we benchmark more than 100 Scandinavian or multilingual models and present the results of these in an interactive online leaderboard, as well as provide an analysis of the results. The analysis shows that there is substantial cross-lingual transfer among the Mainland Scandinavian languages (Danish, Swedish and Norwegian), with limited cross-lingual transfer between the group of Mainland Scandinavian languages and the group of Insular Scandinavian languages (Icelandic and Faroese). The benchmarking results also show that the investment in language technology in Norway, Sweden and Denmark has led to language models that outperform massively multilingual models such as XLM-RoBERTa and mDeBERTaV3. We release the source code for both the package and leaderboard.
翻译:本文介绍了一个斯堪的纳维亚基准测试平台ScandEval,可以在四项不同的任务中使用任何预训练模型进行评估。其中使用的两个任务数据集——语言可接受性和问题回答,是新的数据集。我们还开发并发布了一个Python包和命令行接口scandeval,该接口可以对已上传到Hugging Face Hub的任何模型进行基准测试并获得可复现的结果。利用这个包,我们对100多个斯堪的纳维亚语言或多语言模型进行了基准测试,并在一个交互式的在线排行榜中呈现了这些测试的结果,同时提供了测试结果的分析。分析显示,在斯堪的纳维亚主陆语言(丹麦语、瑞典语和挪威语)之间存在着相当大的跨语言转移,而在斯堪的纳维亚主陆语言和岛屿斯堪的纳维亚语言(冰岛语和法罗语)之间的跨语言转移是有限的。基准测试结果还显示,挪威、瑞典和丹麦在语言技术方面的投资已经导致了优于XLM-RoBERTa和mDeBERTaV3等大规模多语言模型的语言模型。我们同时还开源了包和排行榜的源代码。