We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.
翻译:我们推出了FIN-bench-v2,这是一个用于评估芬兰语大型语言模型的统一基准测试套件。FIN-bench-v2将广泛使用的基准测试的芬兰语版本与原始FIN-bench的更新扩展版整合为一个格式一致的集合,覆盖阅读理解、常识推理、情感分析、世界知识和对齐性等多选题与生成式任务。所有数据集均转换为HuggingFace Datasets格式,包含完形填空和多项选择题提示模板,每个任务提供五种变体;对于机器翻译资源(如GoldenSwag和XED),我们引入了人工标注或审核流程。为筛选鲁棒性任务,我们预训练了一组参数量为2.15B的仅解码器模型,利用其学习曲线计算单调性、信噪比、非随机性能及模型排序一致性,仅保留满足所有标准的任务。我们进一步评估了一系列更大规模的指令微调模型,以刻画不同任务和提示模板下的性能表现。所有数据集、提示模板和评估配置均通过我们分叉的Language Model Evaluation Harness(https://github.com/LumiOpen/lm-evaluation-harness)公开提供。补充资源发布于独立代码库(https://github.com/TurkuNLP/FIN-bench-v2)。