The evaluation of Large Language Models (LLMs) remains challenging due to inconsistency, bias, and the absence of transparent decision criteria in automated judging. We present Debate, Deliberate, Decide (D3), a cost-aware, adversarial multi-agent framework that orchestrates structured debate among role-specialized agents (advocates, a judge, and an optional jury) to produce reliable and interpretable evaluations. D3 instantiates two complementary protocols: (1) Multi-Advocate One-Round Evaluation (MORE), which elicits k parallel defenses per answer to amplify signal via diverse advocacy, and (2) Single-Advocate Multi-Round Evaluation (SAMRE) with budgeted stopping, which iteratively refines arguments under an explicit token budget and convergence checks. We develop a probabilistic model of score gaps that (i) characterizes reliability and convergence under iterative debate and (ii) explains the separation gains from parallel advocacy. Under mild assumptions, the posterior distribution of the round-r gap concentrates around the true difference and the probability of mis-ranking vanishes; moreover, aggregating across k advocates provably increases expected score separation. We complement theory with a rigorous experimental suite across MT-Bench, AlignBench, and AUTO-J, showing state-of-the-art agreement with human judgments (accuracy and Cohen's kappa), reduced positional and verbosity biases via anonymization and role diversification, and a favorable cost-accuracy frontier enabled by budgeted stopping. Ablations and qualitative analyses isolate the contributions of debate, aggregation, and anonymity. Together, these results establish D3 as a principled, practical recipe for reliable, interpretable, and cost-aware LLM evaluation.
翻译:大语言模型(LLM)的评估因自动化评判中存在的不一致性、偏见及决策标准不透明等问题而持续面临挑战。本文提出辩论、审议与决策(D3)框架,这是一种成本感知的对抗式多智能体框架,通过组织角色专精的智能体(辩护方、法官及可选陪审团)进行结构化辩论,以产生可靠且可解释的评估结果。D3 实现了两种互补的协议:(1) 多辩护方单轮评估(MORE),该协议针对每个答案生成 k 个并行辩护以通过多样化的辩护放大信号;(2) 带预算停止的单辩护方多轮评估(SAMRE),该协议在明确的令牌预算和收敛性检查下迭代优化论证。我们建立了一个分数差距的概率模型,该模型(i)刻画了迭代辩论下的可靠性与收敛性,(ii)解释了并行辩护带来的分离增益。在温和假设下,第 r 轮差距的后验分布集中于真实差异附近且误排概率趋于零;此外,跨 k 个辩护方的聚合被证明可提升期望分数分离度。我们通过 MT-Bench、AlignBench 和 AUTO-J 上的严格实验验证理论,结果显示:D3 在人类评判(准确率与 Cohen's kappa)上达到最先进的吻合度;通过匿名化与角色多样化降低了位置与冗长偏见;预算停止机制实现了优越的成本-准确率权衡。消融研究与定性分析分离了辩论、聚合及匿名化各自的贡献。综上,这些结果表明 D3 为可靠、可解释且成本感知的 LLM 评估提供了一种原则性、实用化的解决方案。