瑞典相关事实知识诊断基准 (A Diagnostic Benchmark for Sweden-Related Factual Knowledge)

Many Swedish benchmarks are translated US-centric benchmarks, and therefore not suitable for testing knowledge that is particularly relevant, or even specific, to Sweden. We therefore introduce a manually written question-answering benchmark specifically targeted to Sweden-related personalities and events, many of which receive very limited coverage in international media. Our annotators drew inspiration from a popular radio program featuring public figures from culture and media, as well as major sports events in Sweden. The dataset can be used to measure factual recall across models of varying sizes and degrees of Swedish coverage, and allows to probe cross-lingual factual consistency as to contains English translations. Using the dataset, we find that smaller models with stronger Swedish coverage perform comparably to a three times larger multilingual model in recalling Sweden-related facts. We also observe that continued pre-training on Swedish generally improves factual knowledge but also leads to forgetting of a part of the previously known information. These results demonstrate the dataset's potential as a diagnostic tool for studying language adaptation and knowledge retention in multilingual models and during language adaptation.

翻译：许多瑞典基准测试都是翻译自以美国为中心的基准测试，因此不适合测试与瑞典特别相关甚至特定于瑞典的知识。为此，我们引入了一个手动编写的问题回答基准测试，专门针对与瑞典相关的人物和事件，其中许多在国际媒体上的报道非常有限。我们的标注者从一档受欢迎的广播节目（该节目邀请文化和媒体界的公众人物）以及瑞典的主要体育赛事中汲取灵感。该数据集可用于衡量不同规模和瑞典语覆盖程度的模型的事实回忆能力，并且由于包含英文翻译，可用于探究跨语言事实一致性。使用该数据集，我们发现，在回忆瑞典相关事实方面，具有更强瑞典语覆盖能力的小型模型与规模大三倍的多语言模型表现相当。我们还观察到，在瑞典语上进行持续预训练通常会提高事实知识，但也会导致部分先前已知信息的遗忘。这些结果证明了该数据集作为诊断工具，在研究多语言模型及语言适应过程中的语言适应与知识保留方面的潜力。