谁来评判评判者？LLM按需陪审团：构建可信赖的大语言模型评估系统 (Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems)

As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.

翻译：随着大语言模型（LLMs）被集成到高风险领域，对评估方法的需求日益增长，这些方法既需满足实时部署的可扩展性，又需确保关键决策的可靠性。虽然人工评估可靠，但速度慢且成本高。单一LLM评判者存在偏见，而静态陪审团缺乏适应性。为克服这些限制，我们提出LLM按需陪审团——一种基于学习的动态框架，用于实现可扩展且上下文感知的评估。我们的方法训练了一组可靠性预测器，利用词元分布、嵌入向量和结构输入特征来评估LLM评判者何时与人类专家意见一致。这使得评估完全自适应：针对每个数据点，动态选择由最可靠评判者组成的最优陪审团，并使用其可靠性作为权重聚合评分。在摘要生成和RAG基准测试上的实验表明，我们的动态陪审团系统与人类判断的相关性显著高于单一评判者和静态陪审团基线。这些结果突显了基于学习的自适应陪审团在构建可扩展、更可靠且可信赖的现代LLM高风险领域评估系统方面的潜力。