We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.
翻译:我们提出了GreekBarBench,这是一个基于希腊律师资格考试中五个不同法律领域的法律问题构建的基准,旨在评估大语言模型在需要引用法条条文与案件事实的任务上的表现。为应对自由文本评估的挑战,我们提出了一种结合三维评分系统与大语言模型作为评判者的方法。同时,我们开发了一个元评估基准,用于分析大语言模型评判者与人类专家评估之间的相关性,结果表明采用简单的基于文本片段的评分细则能有效提升两者的一致性。通过对13个专有及开源权重的大语言模型进行系统评估,我们发现即使表现最佳的模型能够超越专家平均分,但仍未达到专家评分的前5%水平。