Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solutions in each language. A held-out panel of LLM judges, including Claude 3.5 Haiku, evaluated solution quality using a comparative framework. Results show a consistent gap, with English solutions consistently rated highest, and Arabic often ranked lower. These findings highlight persistent linguistic bias and the need for more equitable multilingual AI systems in education.
翻译:大型语言模型(LLMs)在教育支持中的应用日益广泛,但其响应质量因交互语言而异。本文提出一种自动化多语言流程,用于生成、求解和评估与德国K-10课程体系对齐的数学问题。我们生成了628道数学练习题,并将其翻译为英语、德语和阿拉伯语。使用三种商用LLM(GPT-4o-mini、Gemini 2.5 Flash和Qwen-plus)分别生成各语言的逐步解题过程。由包含Claude 3.5 Haiku在内的独立LLM评审组采用比较评估框架对解题质量进行评价。结果显示存在持续性的质量差距:英语解题方案始终获得最高评分,而阿拉伯语方案常处于较低评级。这些发现揭示了语言偏见的持续存在,并表明教育领域需要更公平的多语言人工智能系统。