超越共识：缓解LLM评估中的"赞同倾向"偏差 (Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations)

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.

翻译：新型大型语言模型（LLM）每隔数周便会面世，现代应用开发者面临着不得不决定是否切换至新模型的艰巨任务。尽管人工评估仍是黄金标准，但其成本高昂且难以扩展。当前最先进的方法是采用LLM作为评估器（LLM即法官），但这种方法存在一个关键缺陷：LLM表现出强烈的正向偏差。我们提供的实证证据表明，虽然LLM能以高准确率识别有效输出（即真阳性率96%），但其识别无效输出的能力显著不足（即真阴性率<25%）。这种系统性偏差与类别不平衡问题相结合，往往导致可靠性评分虚高。尽管多数表决等集成方法能提供一定帮助，但我们证明其效果仍不充分。我们提出一种最优少数否决策略，该策略对缺失数据具有鲁棒性，并能大幅缓解此类偏差。针对需要更高精度的场景，我们提出一种基于回归的新框架，该框架利用少量人工标注的真实数据直接建模验证器偏差。在一个包含366个高中Python程序的代码反馈挑战任务中，我们的回归方法将最大绝对误差降至仅1.2%，相比由14个前沿LLM构成的最佳集成模型实现了2倍的性能提升。