The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.
翻译:大型语言模型(LLM)的指令遵循能力对于从对话代理到复杂推理系统的众多应用至关重要。然而,当前的评估主要集中于英语模型,忽视了其他语言的语言学和文化细微差异。具体而言,韩语因其独特的句法、丰富的形态特征、敬语体系及双数系统,目前缺乏专门评估开放式指令遵循能力的基准。为填补这一空白,我们提出了韩语指令遵循任务评估基准(KITE),这是一个旨在评估通用及韩语特定指令的综合基准。与现有主要关注事实性知识或多选测试的韩语基准不同,KITE直接针对多样化的开放式指令遵循任务。我们的评估流程结合了自动化指标与人工评估,揭示了不同模型间的性能差异,并对其优势与不足提供了更深入的洞察。通过公开KITE数据集与代码,我们旨在促进针对文化和语言包容性LLM开发的进一步研究,并激励为其他代表性不足的语言开展类似工作。