Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall's τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
翻译:大型语言模型(LLM)智能体通过测试时扩展(TTS)——即由奖励信号引导的迭代优化——正日益自动化复杂的推理过程。然而,许多现实世界任务涉及多阶段流程,其最终结果缺乏可验证的奖励或足够数据来训练稳健的奖励模型,导致基于评判的优化容易在多个阶段中累积误差。我们提出选择性TTS,一种基于过程的优化框架,它在多智能体流程的不同阶段间扩展推理,而非如先前工作那样随时间进行重复优化。通过将计算资源分配到各阶段,并利用过程特定的评判器早期剪枝低质量分支,选择性TTS减轻了评判漂移并稳定了优化过程。基于数据科学流程,我们构建了一个端到端多智能体流程,用于为给定数据集生成具有视觉洞察力的图表与报告,并设计了一个与人类专家对齐(Kendall's τ=0.55)的可靠基于LLM的评判模型。在固定计算预算下,我们提出的选择性TTS提升了洞察质量,将平均分数从61.64提高至65.86,同时降低了方差。我们希望我们的发现能为扩展具有不可验证奖励的复杂、开放式任务(如科学发现与故事生成)迈出第一步。