The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
翻译:用于培训较大和较大语言模型的计算和数据提供量,增加了衡量低语言培训真正进展的可靠方法的需求。近年来,在英语标准化基准基准方面取得重大进展。GLUE、 SuperGLUE或KILT等基准已成为用于比较大语言模型的事实上的标准工具。随着为其他语言复制GLUE的趋势,波兰发布了KLEJ基准。在本文件中,我们评估了为低资源语言制定基准的进展情况。我们注意到,只有少数语言有这样的综合基准。我们还注意到,用资源丰富的英语/中国和世界其他地区基准评估的任务数量存在差距。在本文件中,我们引入了LEPISZZE(波兰语用于Glew、中英语前身、中英语言的波兰新标准),波兰国家语言基准有了新的全面基准,而我们用低资源语言设计LEPISZZEE, 包括新的模型、数据集和任务是尽可能简单的任务,同时为波兰地区提供最新数据版本和新数据蓝图,我们从波兰数据数据库中获取了类似的数据标准,我们用五个标准测试了其他数据标准,我们用了五个标准,我们用新的标准做了5个新的基文件。