Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a "majority" over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.
翻译:多数投票法通过聚合并行推理轨迹,在封闭式问答中已被证明是有效的。然而,它并不直接适用于开放式推理任务,例如代码生成和基于网络的深度研究,因为在这些任务中,对完整解决方案进行“多数”表决的定义并不明确。我们提出了ThinkMerge,一种无需训练、即插即用的解码策略,该策略运行K条并行推理轨迹,并在同步点对其下一个词元的对数概率进行平均,以生成单一连贯的输出。ThinkMerge可与vLLM/SGLang无缝集成,并保持与Top-p/Top-k等标准解码技术的兼容性。实证结果表明,在AIME和GPQA数据集上,其性能达到或超越了多数投票法,同时在开放式编码任务中取得了稳定的提升:在LiveCodeBench(困难版)上,DeepCoder-14B-Preview的pass@1指标提升了+8.28%,Qwen3-8B提升了+7.58%。除代码任务外,我们进一步证明ThinkMerge在GAIA、BrowseComp-en/zh和XbenchDeepSearch数据集上,提升了基于网络的深度研究智能体(例如WebSailor-7B/32B)的性能。这些结果表明,并行测试时扩展可以在不依赖对完整输出进行投票的情况下,使开放式推理受益。