Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
翻译:大型语言模型(LLM)被广泛用作评估响应质量的评判者,为人工评估提供了可扩展的替代方案。然而,大多数LLM评判者仅依赖于内在的文本推理,限制了其验证复杂约束或执行精确计算的能力。受工具集成推理(TIR)在众多任务中成功的启发,我们提出了TIR-Judge——一个用于训练LLM评判者的端到端强化学习框架,该框架集成了代码执行器以实现精确评估。TIR-Judge基于三个原则构建:(i)在可验证与不可验证领域进行多样化训练,(ii)灵活的评判格式(点式、对式、列表式),以及(iii)无需蒸馏、直接从初始模型进行自举的迭代式强化学习。在七个公开基准测试中,TIR-Judge优于基于推理的强基线模型,在点式和对式评估中分别提升高达6.4%和7.7%,并在仅拥有80亿参数的情况下,实现了与Claude-Opus-4相当的列表式评估性能。值得注意的是,完全未使用蒸馏评判轨迹训练的TIR-Judge-Zero,其性能与经过蒸馏的变体相当,这表明工具增强的评判者能够通过迭代式强化学习实现自我进化。