New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.
翻译:新型大型语言模型(LLM)每隔数周便会面世,现代应用开发者面临着不得不决定是否切换至新模型的艰巨任务。尽管人工评估仍是黄金标准,但其成本高昂且难以扩展。当前最先进的方法是采用LLM作为评估器(LLM即法官),但这种方法存在一个关键缺陷:LLM表现出强烈的正向偏差。我们提供的实证证据表明,虽然LLM能以高准确率识别有效输出(即真阳性率96%),但其识别无效输出的能力显著不足(即真阴性率<25%)。这种系统性偏差与类别不平衡问题相结合,往往导致可靠性评分虚高。尽管多数表决等集成方法能提供一定帮助,但我们证明其效果仍不充分。我们提出一种最优少数否决策略,该策略对缺失数据具有鲁棒性,并能大幅缓解此类偏差。针对需要更高精度的场景,我们提出一种基于回归的新框架,该框架利用少量人工标注的真实数据直接建模验证器偏差。在一个包含366个高中Python程序的代码反馈挑战任务中,我们的回归方法将最大绝对误差降至仅1.2%,相比由14个前沿LLM构成的最佳集成模型实现了2倍的性能提升。