In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.
翻译:本文针对广泛使用的常识推理基准测试集WinoGrande,提出了一个经过本地化与文化适配的爱沙尼亚语翻译版本。我们详细阐述了由专业翻译人员执行的翻译与适配流程,并评估了专有模型与开源模型在人工翻译基准上的表现。此外,我们通过将人工翻译过程中的洞见融入详细提示的设计,探索了实现高质量机器翻译的可行性。该提示特别针对爱沙尼亚语的语言特性及WinoGrande数据集带来的独特翻译挑战进行定制。研究结果表明,模型在人工翻译的爱沙尼亚语数据集上的表现略低于原始英语测试集,而在机器翻译数据上的表现显著更差。同时,实验显示提示工程对翻译质量或模型准确性的提升有限,并强调了在数据集翻译与适配中引入语言专家的必要性,以确保对大型语言模型的语言能力与推理进行可靠且可解释的评估。